Query Pattern Access and Fuzzy Clustering Based Intrusion Detection System
1. CERTIFICATE
This is to certify that this project titled “Query Pattern Access and Fuzzy Clustering Based
Intrusion Detection System” submitted by Shivam Gupta (2K16/CO/295), Shivam Maini
(2K16/CO/299), Shubham (2K16/CO/309) and Simran Seth (2K16/CO/317) in partial
fulfilment for the requirements for the award of Bachelor of Technology degree in Computer
Engineering (COE) at Delhi Technological University is an authentic work carried out by the
students under my supervision and guidance.
To the best of my knowledge, the matter embodied in the thesis has not been submitted to any
other university or institute for the award of any degree or diploma.
Ms. Indu Singh
(Assistant Professor)
Department of CSE
Delhi Technological University
2. DECLARATION
We hereby certify that the work which is presented in the project entitles “Query Pattern
Access and Fuzzy Clustering Based Intrusion Detection System” in fulfilment for the
requirement for the award of the degree of Bachelor of Technology and submitted to the
Department of Computer Engineering, Delhi Technological University is an authentic record
of our own carried out during a period from April 2018 to February 2019, under the supervision
of Ms. Indu Singh (Assistant Professor, CSE Department).
The matter presented in this report has not been submitted by me for the award of any other
degree of this or any other Institute/University.
Shivam Gupta (2K16/CO/295) Shivam Maini (2K16/CO/299)
Shubham (2K16/CO/309) Simran Seth (2K16/CO/317)
3. ACKNOWLEDGEMENTS
“The successful completion of any task would be incomplete without accomplishing the people
who made it all possible and whose constant guidance and encouragement secured us the
success.”
We owe a debt of gratitude to our guide Ms. Indu Singh (Assistant Professor CSE
Department) for incorporating in us the idea of a creative project, helping us in undertaking
this project and for being there whenever we need her assistance.
I also place on record, my sense of gratitude to one and all, who directly or indirectly have lent
their helping hand in this venture.
I feel proud and privileged in expressing my deep sense of gratitude to all those who have
helped me in presenting this project.
Last but never the least, we thank our parents for always being with us, in every sense.
4. PROBLEM STATEMENT
The aim of the project is to build an intrusion detection system that provides the following
functionalities:
The designed system must be able to detect any anomalous behaviour by any user and
raise an alarm and take necessary response against such behaviour.
Our system must be robust to user behaviour.
Detect and prevent insider frauds in a credit card company.
Provide higher level of access control to critical data items (like CVV)
The designed system should be free from any vulnerabilities of an outsider attack such
as session hijacking, session fixation, data theft etc.
The system must block all transactions that don’t fall under the user’s jurisdiction by
maintaining user behaviour logs.
5. MOTIVATION
An Anomaly-Based Intrusion Detection System is a system for detecting computerised
intrusions and misuse by monitoring system activity and classifying it as either normal or
anomalous. The classification is based on heuristics or rules, rather than patterns or signatures,
and will detect any type of misuse that differs significantly from normal system operation.
Earlier, IDSs relied on some hand coded rules designed by security experts and network
administrators. However, given the requirements and the complexities of today’s network
environments, we need a systematic and automated IDS development process rather than the
pure knowledge based and engineering approaches which rely only on intuition and experience.
This encouraged us to study some Data Mining based frameworks for Intrusion Detection.
These frameworks use data mining algorithms to compute activity patterns from system audit
data and extract predictive features from the patterns. Machine learning algorithms are then
applied to the audit records that are processed according to the feature definitions to generate
intrusion detection rules.
The Data Mining based approaches that we have studied can be divided into two main
categories :-
1. Supervised Learning
a. Association Rule Mining
2. Unsupervised Learning
a. Clustering
6. OBJECTIVE
The main purpose of our paper is to monitor user access. Our Intrusion Detection System (IDS)
pays special attention to certain semantically critical data elements along with those elements
which can be used to infer them. We present an innovative approach to combine a user’s
historic and present access pattern, and hence classify the incoming transaction as malicious or
non-malicious. Using Fuzzy C-Means, we partition the users into fuzzy clusters. Each of these
clusters contains a set of rules in their cluster profiles. New transactions are checked in the
detection phase using these clusters. The main advantage of our IDS lies in its ability to prevent
inference attacks on Critical Data Elements and take into account the user’s historic behaviour.
7. ABSTRACT
Hackers and malicious insiders perpetually try to steal, manipulate and corrupt sensitive data
elements and an organization’s database servers are often the primary targets of these attacks.
In the broadest sense, misuse (witting or unwitting) by authorized database users, database
administrators, or network/systems managers are potential insider threats that our project
intends to address. Insider threats are more menacing because in contrast to outsiders (hackers
or unauthorised users), insiders have authorised access to the database and have knowledge
about the critical nuances of the database. Database security involves using multitude of
information security controls to protect databases against breach of confidentiality, integrity
and availability (CIA). QPAFCS (Query Pattern Access and Fuzzy Clustering System)
involves plethora of controls such as technical, procedural/administrative and physical. We
hence intend to propose an Intrusion Detection System (IDS) that monitors a database
management system and prevents inference attacks on sensitive attributes, by means of auditing
user access patterns.
Keywords: Intrusion Detection, Fuzzy Clustering, User Access Pattern, Insider Attacks,
Dubiety Score
8. 1. INTRODUCTION
Data protection from insider threats is essential to most organizations. Attacks from insiders
could be more damaging than those from outsiders, since in most cases insiders have full or
partial access to the data; therefore, traditional mechanisms for data protection, such as
authentication and access control, cannot be solely used to protect against insiders. Since recent
work has shown that insider attacks are accompanied by changes in the access patterns of users,
user access pattern mining [1] is a suitable approach for the detection of these attacks. It creates
profiles of the normal access patterns of users using past logs of users accesses. New accesses
are later checked against these profiles and mismatches indicate potential attacks.
A security technique called Access control [2] can regulate who can view or use resources in a
computing environment. There are diverse access control systems that perform authorization,
identification, authentication and access approval. Intrusion Detection Systems [3]. scrutinise
and unearth surreptitious activities perpetrated by malevolent users. IDS work by either looking
for signatures of known attacks or deviations of normal activity. Normally, IDS undergo a
training phase with intrusion free data wherein they maintain a log of benign transactions.
Pattern matching [4] is then used to detect whether or not an action is malign. This is called
anomaly-based detection [5]. When errors are detected using their known “signatures” from
previous knowledge of the attack, it is called signature-based detection [6]. These malicious
actions once detected are then either blocked or probed depending upon the organisation’s
policy. However, IDS need to be dynamic, robust and quick. Different architectures for IDS
function differently and have different measures of performance. Every organisation needs to
make sure that the IDS it uses satisfies its requisites.
Several AD techniques have been proposed to detect anomalous data accesses. Some rely on
the analysis of input queries based on the syntax. Although these approaches are
computationally efficient, they are unable to detect anomalies in scenarios like the following
one. Consider a clerk in an organization who issues queries to a relational database that
typically selects a few rows from specific tables. An access from this clerk that selects all or
most of the rows of these tables should be considered anomalous with respect to the daily
access pattern of the clerk. However, approaches based on syntax only are not able to classify
such access as anomalous. Thus, syntactic approaches have to be extended to take into account
the semantic features of queries such as the number of result rows. An important requirement
is that queries should be inspected before their execution in order to prevent malicious queries
from making changes to the database.
From the technical perspective, the main purpose is to ensure the effective enforcement of
security regulations. Audit is an important technique of examining whether user behaviours in
a system are in conformance with security policies. Many methods audit a database processing
by comparing a user SQL query expression against some predefined patterns so as to find out
an anomaly. But a malicious query may be made up as good looking so as to evade such
syntactic detection. To overcome this shortcoming, the data-centric method further audits
whether the data a user query actually accessed has involved any banned information.
However, such audit concerns a concrete policy rather than the overall view of multiple security
policies. It requires clear audit commands that are articulated by experienced professionals and
much interactive analysis. Since in practice an anomaly pattern cannot be articulated in
advance, it is difficult to detect such fraud by the current audit method.
9. The anomaly detection technology is used to identify abnormal behaviours that are statistical
outliers. Some probabilistic methods learned normal patterns, against which they detected an
anomaly. But these methods assume very few users are deviated from normal patterns. In case
there are a number of anomalous users, the normal pattern would be diverged. These works do
not examine user behaviour from either a historical or an incremental view, which may
overlook some malicious behaviour. Furthermore, if a group of people collude together, it is
difficult to find them by the current methods.
We tackle the insider threat problem using different approaches. We take into consideration
the fact that certain data elements are more critical to the database as compared with other data
elements. Thus, we pay special attention to the security of such critical data elements. We also
recognise the presence of data attributes in a system which can be manipulated to indirectly
influence the crucial data attributes. We address the threat to our critical data elements using
such attributes also.
We investigate a suspected user also from the diachronic view by analysing his/her historical
behaviour. We store a measure denoting how suspicious a user has been. The greater this
measure, the greater the chances of the query being malicious. This measure also solves the
problem of gradually malicious threat since the historical statics measures the accumulative
results.
The main purpose of our project (QPAFCS) is to recognise user access pattern. Our Intrusion
Detection System (IDS) pays special attention to certain semantically critical data elements,
along with those elements which can be used to infer them. We present an innovative approach
to combine a user’s historic and present access pattern and hence classify the incoming
transaction as malicious/non-malicious. Using FCM, we partition the users into fuzzy clusters.
Each of these clusters contains a set of rules in their cluster profiles. In the detection phase,
new transactions are checked against rules in these clusters, and then a suitable action is taken
depending upon the nature of transaction. The main advantage of our IDS lies in its ability to
prevent inference attacks on Critical Data Elements.
The remainder of this work is organized as follows. In Sect. 2, we present prior research related
to this work. Section 3 introduces the fuzzy clustering and belief update framework. Section 4
discusses the approach using examples. In Sect. 5, we discuss how to apply our method into
practical system. Experimental evaluation is discussed in Section. 6.
10. 2. RELATED WORK
Numerous researchers are currently working in the field of Network Intrusion Detection
Systems, but only a few have proposed research work in Database IDSs. Several systems for
Intrusion Detection in operating systems and networks have been developed, however they are
not adequate in protecting the database from intruders.[11] ID system in databases work at
query level, transaction level and user (role) level. Bertino et. al. described the challenges to
ensure data confidentiality, integrity and availability and the need of database security wherein
the need of database IDSs to tackle insider threats was discussed.
Panda et. al. [19] propose to employ data mining approach for determining data dependencies
in the database system. The classification rules reflecting data dependencies are deduced
directly from the database log. These rules represent what data items probably need to be read
before an update operation and what data items are most likely to be written following this
update operation. Transactions that are not compliant to the data dependencies generated are
flagged as anomalous transactions.
Database IDSs include Temporal Analysis of queries and Data dependencies among attributes,
queries and transaction. Lee et al. [28] proposed a Temporal Analysis based intrusion detection
method which incorporated time signatures and recorded update gap of temporal attributes.
Any anomaly in update pattern of the attribute was reported as an intrusion in the proposed
approach. The breakthrough introduction to association rule mining by Aggarwal et. al. [22]
helped in finding data dependencies among data attributes, which was incorporated in the field
of intrusion detection in Databases.
During the initial development of data dependency association rule mining, DEMIDS, a misuse
detection system for relational database systems was proposed by Chung et. al. [7] Profiles
which specified user access pattern were derived from the audit log and Distance Metrics were
further applied for recognizing data items These were used together in order to represent the
expanse of users. But once the number of users for a single system becomes substantial,
maintaining profiles becomes a redundant procedure. Another flaw was the system assuming
domain information about a given schema.
Hu et. al. [16] presented a data mining-based intrusion detection system, which used the static
analysis of database audit log to mine dependencies among attributes at transaction level and
represented those dependencies as sets of reading and writing operations on each data item. In
another approach proposed by Hu et. al., techniques of sequential pattern mining have been
applied on the training log, in order to identify frequent sequences at the transaction level. This
approach helped in identifying a group of malicious transactions, which individually complied
with the user behavior. The approach was improved in by Hu et. al. by clustering legitimate
user transaction into user tasks for discovery of inter-transaction data dependencies.
The method proposed extends the approach by assigning weights to all the operations on data
attributes. The transactions which didn’t follow the data dependencies were marked as
11. malicious. The major disadvantage of user assigned weights is the fact that they are static and
unrelated to other data attributes. Kamra et. al. [27] employed a clustering technique on an
RBAC model to form profiles based on attribute access which represented normal user
behavior. An alarm is raised when anomalous behavior of that role profile is observed.
(Bezdek, Ehrlich & Full, 1984) proposed the Fuzzy C-Means Algorithm. The basic idea behind
this approach was to illustrate the similarity a data point may share with each of the clusters
with help of a function often referred to as membership function. This measure of similarity
lies between zero and one signifies the extent of similarity between the data point and the
cluster and is termed as the membership value. The main aim of this technique is to construct
fuzzy partitions of a particular data set.
Y. Yu et. al. [29] illustrated a fuzzy logic-based Anomaly Intrusion Detection System. A Naive
Bayes Classifier is used to classify an input event as normal or anomalous. The basis of
classifier is formed by the independent frequency of each system call from a process in normal
conditions. The ratio of the probability of a sequence from a process and the probability not
from the process serves as the input of a fuzzy system for the classification.
A hybrid approach was described by Doroudian et. al. [26] to identify intrusion at both
transaction and inter-transaction level. At the transaction level, a set of predefined expected
transactions were specified to the system and a sequential rule mining algorithm was applied
at the inter transaction level to find dependencies between the identified transactions. The
drawback of such a system is that sequences with frequencies lower than the threshold value
are neglected. Therefore, the infrequent sequences were completely overlooked by the system,
irrespective of their importance. As a result, the True Positive Rate falls down for the system.
The above drawback was overcome by Sohrabi et. al. [20] who proposed a novel approach
ODARDM, in which rules were formulated for lower frequency item sets, as well. These rules
were extracted using leverage as the rule value measure, which minimized the interesting data
dependencies. As a result, True Positive Rate increased while the False Positive Rate
decreased. In recent developments, Rao et. Al [21] presented a Query Access detection
approach through Principal Component Analysis and Random Forest to reduce data
dimensionality and produce only relevant and uncorrelated data. As the dimensionality is
reduced, both, the system performance and True Positive rate increases.
In 2009, Majumdar et. al. [15] propose a comprehensive database intrusion detection system
that integrates different types of evidences using an extended Dempster-Shafer theory. Besides
combining evidences, they also incorporate learning in our system through application of prior
knowledge and observed data on suspicious users. In 2016, Bertino et. al. [14] tackled the
insider threat problem from a data-driven systemic view. User actions are recorded as historical
log data in a system, and the evaluation investigates the date that users actually process. From
the horizontal view, users are grouped together according to their responsibilities and a normal
pattern is learned from the group behaviours. They investigate a suspected user also from the
diachronic view by comparing his/her historical behaviours with the historical average of the
same group.
Anomaly detection has been an important research problem in security analysis, therefore
development of methods that can detect malicious insider behavior with high accuracy and low
false alarm is vital [10]. In this problem layout, McGough et al [8] designed a system to identify
12. anomalous behavior of user by comparing of individual user’s activities against their own
routine profile, as well as against the organization’s rule. They applied two independent
approaches of machine learning and Statistical Analyzer on data. Then results from these two
parts combined together to form consensus which then mapped to a risk score. Their system
showed high accuracy, low false positive and minimum effect on the existing computing and
network resources in terms of memory and CPU usage.
Bhattacharjee et al proposed a graph-based method that can investigate user behavior from two
perspectives: (a) anomaly with reference to the normal activities of individual user which has
been observed in a prolonged period of time, and (b) finding the relationship between user and
his colleagues with similar roles/profiles. They utilized CMU-CERT dataset in unsupervised
manner. In their model, Boykov Kolmogorov algorithm was used and the result compared with
different algorithms including Single Model One-Class SVM, Individual Profile Analysis, k-
User Clustering and Maximum Clique (MC). Their proposed model evaluated by evaluation
metrics Area-Under-Curve (AUC) that showed impressive improvement compare to other
algorithms [9].
T. Rashid et al. proposed that parameter learning task in HMMs is to find, given an output
sequence or a set of such sequences, the best set of state transition and emission probabilities.
The task is usually to derive the maximum likelihood estimate of the parameters of the HMM
given the set of output sequences. No tractable algorithm is known for solving this problem
exactly, but a local maximum likelihood can be derived efficiently using the Baum–Welch
algorithm or the Baldi–Chauvin algorithm. The Baum–Welch algorithm is a special case of the
expectation-maximization algorithm. If the HMMs are used for time series prediction, more
sophisticated Bayesian inference methods, like Markov chain Monte Carlo (MCMC) sampling
are proven to be favorable over finding a single maximum likelihood model both in terms of
accuracy and stability.[12] Log data are considered as high-dimensional data which contain
irrelevant and redundant features. Feature selection methods can be applied to reduce
dimensionality, decrease training time and enhance learning performance .
13. 3. OUR APPROACH
3.1 Basic Notations
Large organisations deal with tremendous amount of data whose security is of prime interest.
The data in databases comprises of attributes describing real life objects called as entities. The
attributes have varying levels of sensitivity, i.e. not all attributes are equally important to the
integrity of database. As an example, the signatures and other biometric data are highly
sensitive data attributes for a financial organisation like Bank in comparison to others like
name, gender etc. So, unauthorised access to the crucial attributes is of a greater concern. Only
certain employees may have access to such data elements and access by all others must be
blocked instantaneously to ensure Confidentiality and consistency of data.
Our proposed model QPAFCS(Query Pattern Access and Fuzzy Clustering System) pays
special attention to sensitive data attributes and they have been referred to as CDE (Critical
Data Elements) in the text. The attributes that can be used to indirectly infer CDEs are also
critical to the functioning of the organisation. For instance, account number of a user may be
used to access the signatures and other crucial details about him. Such attributes have been
referred to as DAE (Directly Associated Elements) in the text.
We propose a two-phase detection and prevention model that clusters users based on similarity
of their attribute access patterns and the types of queries performed by them, i.e. our model
tries to track the user access pattern of each user and further classify it as normal or malicious.
The superiority of our model lies in its ability to prevent unauthorised retrieving and
modification of most sensitive data elements (CDEs). Our model also makes sure that the query
pattern for access of CDEs is specific and fixed for a particular user to avoid data breaches, i.e.
the user associates himself with his regular access behaviour. Any deviation from the regular
arrangement may lead to depreciation of user’s confidence and may act as representative of
user’s malicious intent. The following terminologies are used:
Definition 1 (Transaction) Set of queries executed by a user. Each transaction is represented
by a unique transaction ID and also carries the user’s ID. Hence <Uid,Tid> act as unique
identification key for each set of query patterns. Each Transaction T is denoted as
<Uid, Tid, <q1, q2, … qn>>
where
qi denotes the ith query, i ∈ [1 … n]
For example, suppose a user has id 1001. He/she then executes the following set of SQL
queries:
q1: SELECT a,b,c
FROM R1,R2
WHERE R1.A>R2.B
q2: SELECT P
FROM R5
14. WHERE R5.P==10
Then this is said to be a transaction of the form:
t=<1001,67,<q1,q2>>
Definition 2 (Query) A query is a standard database management system token/request for
inserting and retrieving data or information from a database table or combination of tables. We
define query as a read or write request on an attribute of the relation. A query is represented as
<O(D1), O(D2), … O(Dn)>
where,
D1, D2, … Dn ∈ Rs
where Rs is the relation schema and Di are the attributes. O represents the operations i.e. Read
or write Operations. O ∈ {R, W}
For example, examine the following transaction:-
start transaction
select balance from Account where Account_Number='9001';
select balance from Account where Account_Number='9002';
update Account set balance=balance-900 where Account_Number='9001' ;
update Account set balance=balance+900 where Account_Number='9002' ;
commit; //if all SQL queries succeed
rollback; //if any of SQL queries failed or error
The query corresponding to this transaction is:
<<R(Account_Number),R(balance)>, <R(Account_Number),R(balance)>,
<R(Account_Number),R(balance),W(balance)>,
<R(Account_Number),R(balance),W(balance)>>
Definition 3 (Read Sequence) A read sequence is defined as
{R(x1), R(x2), … O(xn)}
where O represents the operations i.e. Read or write Operations. O ∈ {R, W}. The Read
sequence represents that the transaction may need to read all data items x1, x2, …, xn-1 before
the transaction performs operation (O∈ {R, W}) on data item xn.
For example, consider the following update statement in a transaction.
Update Table1 set x = a + b + c where d = 90;
In this statement, before updating x, values of a, b, c and d must
be read and then the new value of x is calculated. So <R(a), R(b),
R(c), R(d), W(x)> ∈ RS(x).
15. Definition 4 (Write Sequence) A write sequence is defined as
{O(x1), W(x2), … W(xn)}
where O represents the operations i.e. Read or write Operations i.e. O ∈ {R, W} which
represents that the transaction may need to write all data items x1, x2, …, xn-1 in this order
after the transaction operates on data item xn.
For example, consider the following update statements in one transaction.
Update Table1 set x = a + b + c where a=50;
Update Table1 set y = x + u where x=60;
Update Table1 set z = x + w + v where w=80;
Using the above example, it can be noted that <W(x), W(y),W(z)>
is one write sequence of data item x, that is <W(x), W(y),W(z)> ∈
WS(x), where WS(x) denotes the write sequence set of x.
Definition 5 (Read Rules (RR)) Read rules are the association rules generated from Read
sequences whose confidence is greater than the user defined threshold (Ψconf). A read rule is
represented as
{R(x1), R(x2) ...} ⇒ O(x).
For all sequential patterns <R(x1), R(x2), …, R(Xn-1), O(xn) > in read sequence set, generate
the read rules with the format {R(x1), R(x2) ...} ⇒ O(xn). If the confidence of the rule is larger
than the minimum confidence (Ψconf), then it’s added to the answer set of read rules, which
implies that before xn, we need to read x1,x2…….. xn-1
For example:
The Read Rule corresponding to the read sequence <R(a), R(b),
R(c), R(d), W(x)> is:
{R(a), R(b), R(c), R(d)} ⇒ W(x)
Definition 6 (Write Rules (WR)) Write rules are the association rules generated from write
sequences whose confidence is greater than the user defined threshold (Ψconf). A write rule is
represented as
O(x) ⇒ {W(x1), W(x2) …}
For all sequential patterns O(x), W(x1), W(x2), …,(xk) in the write sequence set, generate the
write rules with the format O(x)→W(x1), w(x2), …, w(xk). If the confidence of the rule is larger
than the minimum confidence (Ψconf), then it’s added in the set of write rules which depicts
after updating x, data
items x1, x2, …, xk must be updated by the same transaction.
For Example: The write rule corresponding to the write sequence
16. <W(x), W(y),W(z)> is W(x) ⇒ {W(y),W(z)}
Definition 7 (Critical Data Elements (CDE)) They are semantically defined data elements
crucial to the functioning of the system. They are the data attributes of prime significance
having direct correlation to the integrity of the system. In a vertically hierarchical organisation,
these are the attributes accessed only by the top level management, and the access by lower
levels of hierarchy is strictly protected.
Type of Attribute Sensitivity Level
Critical data Elements Highest
Directly Associated Elements Medium
Normal Attributes Low
Table 3.1 Types of attributes and their sensitivity levels
CDEs are tokens of behaviour that our model uses for the malicious activity recognition of
users of system.
Definition 8 (Critical Rules (CR)) A set of rules that contain a Critical Data Element in its
antecedent or consequent.
CR = {ζ | (ζ ∈ RR ∨ ζ ∈ WR) ∩ (x ∈ CDE ∩ ({R(x1), R(x2) …} ⇒ O(x) ∪ O(x) ⇒ {W(x1),
W(x2) …}))}
We propose a method of user Access Pattern Recognition using the Critical Rules. CR
recognize the actions and goals of Users from a series of observations on the users' actions and
the environmental conditions, i.e. the user query pattern associated to the Critical data elements.
Definition 9 (Directly Associated elements (DAE)) The attributes except those present in
CDE, which are either part of antecedents or consequents of Critical Rules.
DAE = {μi| μi ∈ CR ∩ μi ∉ CDE}.
The query patterns as perceived by our model QPAFCS are explored using DAEs that represent
the first level of access of the CDEs. A user's behaviour is represented by a set of first-order
statements (derived from queries) called attribute hierarchy encoded in first-order logic, which
defines abstraction, decomposition and functional relationships between types of access
arrangements. The unit-transactions accessing CDEs are decomposed into attribute hierarchy
comprising of DAEs, which further represents the user’s most sensitive retrieval pattern.
Example:
R(b) → R(a)
R(b), R(c) → R(a)
If a is a CDE, then the set {b,c} represents DAEs.
Definition 10 (Dubiety Score(φ)) A measure of anomaly exhibited by a user in the past based
on his historic transactional data. This score summarizes the user’s historic malicious access
attempts. Dubiety Score attempts to quantify the personnel vulnerability that the organisation
faces because of a particular user.
17. Dubiety Score is indicative of the amount of deviation between the user’s access pattern and
his designated role. Dubiety Score combined with the deviation of user’s present query from
his normal behaviour pattern, yields the output of the proposed IDS.
For our paper:
0<= φ<=1. (1)
Higher the Dubiety Score, more is the evidence against user following the assigned role, that
is more is the malicious intent i.e. rogue behaviour.
Definition 11 (Dubiety Table) A table maintaining the record of dubiety scores of each user.
It contains two attributes: UserID and Dubiety Score.
The initial Dubiety scores are set to 1.
Uid φ
1001 1
1002 1
1003 1
1004 1
1005 1
Table 3.2 Initial Dubiety Table
The dubiety table is updated each time a user performs query.
For example:
Let user 1001’s deviation from normal query is quantified as 0.81, Then the updated Dubiety
table is as shown.
Where:
ds = deviation from normal query
φi = Initial dubiety score.
Uid √𝑑𝑠 ∗ фi
1001 0.9
1002 1
1003 1
1004 1
1005 1
Table 3.3 Updated Dubiety Table
The Updated Dubeity table is hence stored in memory for further processing.
18. 3.2 Learning Phase
We start our learning phase by reading the training dataset into the memory and extracting
useful patterns out of it. Our system requires non-malicious training dataset composed of
transactions executed by trusted users. The model aims at generating user-profiles from the
transaction-logs and quantifies deviation from normal behaviour i.e. this phase aims to
recognise and characterise the user activity pattern on the basis of their queries arrangement.
The following are various components of architecture of the proposed model:
Fig 3(a) Learning Phase Architecture
COMPONENTS OF ARCHITECTURE:
Training data: A transaction log is a sequential record of all changes made to the database
while the actual data is contained in a separate file. The transaction log contains enough
information to undo all changes made to the data file as part of any individual transaction. The
log records the start of a transaction, all the changes considered to be a part of it, and then the
final commit or rollback of the transaction. Each database has at least one physical transaction
log and one data file that is exclusive to the database for which it was created. Our initial input
to the learning phase algorithm is the transaction log, with only authorised and consistent
transactions. This data is free of any unauthorised activity and is used to form user profiles,
role profiles etc based on normal user transactions. The logs are scanned, and the following
elements are extracted:
a. SQL Queries
b. The user executing a given query
SQL query parser: This is a tool that takes SQL queries as input, parses them and produces
sequences (read and write) corresponding to the SQL query as output. The query parser also
19. assigns a unique Transaction ID. The final output consists of two 3 columns: (TID), UID (User
ID) and the read and write sequence generated by the parsing algorithm.
As an Example, if the following transaction performed by user U1001 is examined:
start transaction
select balance from Account where Account_Number='9001';
commit; //if all SQL queries succeed
rollback; //if any of SQL queries failed or error
The parser generates a unique Transaction ID say T1234 followed by parsing the transaction.
The parser finally yields:
< T1234,U1001,<R(Account_number),R(balance)>>
Frequent sequences generator: After the SQL query parser generates the sequences, the
generated sequences are pre-processed. Then weights are assigned to data items, for instance
the CDEs are given greater weight as compared to DAEs and other normal attributes. Then
finally these pre-processed sequences are given as inputs to frequent sequences generator. It
uses the prefix span algorithm to generate frequent sequences out of input sequences
corresponding to each UID.
Rule generator: The frequent sequences are given as inputs to the rule generator module
which uses association rule mining to generate read rules and write rules out of the frequent
sequences.
As an example, if the input frequent sequences are:
1. <R(m),R(n),R(o),W(a)>
2. <R(m),R(n),W(o),W(a)>
3. <R(m),W(n),W(o),W(a)>
4. <W(a),R(b),W(o)>
5. <R(a),R(b),R(m),W(a)>
6. <R(a),R(b),W(m),W(b)>
S.No. Frequent Sequences Associated Rules
1 <R(m),R(n),R(o),W(a)> R(m),R(n),R(o) →W(a)
2 <R(m),R(n),W(o),W(a)> R(m),R(n),W(o) →W(a)
3 <R(m),W(n),W(o),W(a)> R(m),W(n),W(o) →W(a)
4 <W(a),R(b),W(o)> W(a),R(b) →W(o)
5 <R(a),R(b),R(m),W(a)> R(a),R(b),R(m) →W(a)
6 <R(a),R(b),W(m),W(b)> R(a),R(b),W(m) →W(b)
Table 3.4 Rule Generator for given Example
DAE generator: In our approach, we semantically define a class of data items known as
Critical data elements or CDEs. These CDEs and rules are given as input to our DAE (Directly
20. associated element) generator which specifies all those elements as DAE which are present in
either the antecedent or the consequent of those rules that involve at least one of the CDEs.
User vector generator: Using the
frequent sequences for the given audit
period, it generates the user vectors. A
user vector is of the form
BID = < UID, w1, w2, w3, ... wn >
where wi = |O(ai)|.
|O(ai)| represents the total number of times
user with the given Uid performs operation
(O ∈ {R, W}) on the aforesaid attribute ai
in the pre-decided audit period. An audit
period τ refers to a period of time such as
one year, a time window τ = [t1, t2] or the
recent 10 months. User vector is
representative of user’s activity.
Each of these wi would represent how frequently a user performs the operation on the particular
data item. It also can be used in a normalized form, as is used in our proposed model QPAFCS.
UVID = <UID, < p(a1), p(a2), p(a3), … p(an)>>
where,
p(𝑎 𝑘) =
𝑤 𝑘
∑ 𝑤𝑗𝑤 𝑗 𝜖 𝐵𝑖
p(ak) is defined as the probability of accessing the attribute ak.
Value of p(𝑎 𝑘)close to 1 would mean that the user accesses the given attribute frequently.
Cluster generator: It takes user vectors and rules as input and generates fuzzy clusters.
Users are clustered into different fuzzy clusters based on the similarity of their user vectors[29].
A cluster profile would include
Ci = <CID, {R}>
where, CID represents the cluster centroid, and
{R} is a set of rules which is formed by taking the union of all the rules that the members of
the given fuzzy cluster abide by.
We have used Fuzzy c-means[26] clustering to create cluster. Each user belongs to a cluster to
a certain degree wij.
Where:
wij represents the membership coefficient of the ith user (ui) with the jth cluster
The centre of a cluster (α) is the mean of all points, weighted by their membership
coefficients[28]. Mathematically,
Algorithm 1: DAE Generator
Data: CDE, Set DAE = {}, RR = Set of Read
Rules, WR = Set of Write Rules
Result: The set of Directly Associated
elements DAE
Function: DAE Generator (CDE, RR, WR)
for Ω є RR ∪ WR do
for α є Ω do
if α є CDE
while β є Ω do
DAE {} ⃪ β
end
end
end
end
21. 𝑤𝑖𝑗 =
1
∑ (
||𝑢 𝑖 −𝛼 𝑗||
|| 𝑢 𝑖−𝛼 𝑘||
)
2
𝑚−1𝐶
𝑘=1
𝛼 𝑘 =
∑ 𝑤(𝑢) 𝑚
𝑢𝑢
∑ 𝑤(𝑢) 𝑚
𝑢
The objective function that is minimized to create clusters is defined as:
𝑎𝑟𝑔 𝑚𝑖𝑛 ∑ ∑ 𝑤𝑖𝑗
𝑚
||𝑢 𝑖 − 𝛼𝑗||2
𝐶
𝑗=1
𝑛
𝑖=1
where
n is the total number of users,
C is the number of clusters, and
m is the fuzzifier.
The dissimilarity/distance function used in the formation of fuzzy clusters is the modified
Jenson Shannon distance[27] which is illustrated as:
Given two user vectors[13]
UVx = <Ux, < px(a1), px(a2), px(a3), … px(an)>> and
UVy = <Uy, < py(a1), py(a2), py(a3), … py(an)>>
of equal length n, the modified Jensen Shannon distance is computed as
𝐷(𝑈𝑉𝑝||𝑈𝑉𝑞) = ∑
(
(1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))log2
(1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))
(1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))
+
(1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤(𝑎𝑖)) log2
(1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤(𝑎𝑖))
(1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤(𝑎𝑖)))
2
𝑛
𝑖=1
where, w(ai) is the semantic weight associated with the ai
th attribute
User profile generator: This module takes user vectors and the cluster profiles as input and
generates user profiles. A user profile is of the form
Ui=<UID, < p(a1), p(a2), p(a3) … p(ak) >, < c1, c2, … cC > >
where
UID is a unique ID given to each user,
<p(a1), p(a2), p(a3), … p(an)> is a 1-D matrix containing the probability of the user accessing a
particular attribute, and
22. < c1, c2, … cC > is a vector representing the membership coefficients of the given user for C
different clusters.
As an Example:
Table 3.5 User profile for the given Example
Consider a system with 4 fuzzy clusters and 4 attributes, the given table illustrates the profile
of user U1001.
3.3 Testing Phase
In section 3.2, the learning phase is described, in which the system is trained using non-
malicious or benign transactions. Now the trained model can be used to detect malicious
transactions. In this phase, a test query is obtained as input and it is compared with the model’s
perception of user’s access pattern, and the model perpetually evaluates if the test transaction
is malicious. It is first checked whether the user is trying to access a CDE. If yes, the transaction
is allowed only if the given user has accessed that CDE before. Next, it is checked if any DAE
is being accessed. A user can perform write operation on a DAE iff it is previously written by
the same user, otherwise the transaction is termed as malicious. Next, we check if the
transaction abides by the rules that are generally followed by similar users.
PHASES OF TESTING PHASE:
Rule generator: This module takes the sequence as generated by the SQL query parser and
gives the rule that the input transaction follows. This can be a read rule or a write rule and
indicates the operations done by the user, data attributes accessed by the user and the order in
which they are accessed. Now this rule can be checked for maliciousness.
CDE Detector: The semantically critical elements referred to in our approach as CDEs are
detected in this module. The read/ write rule corresponding to the incoming transaction is
checked for the presence of CDEs. If the rule being checked for maliciousness contains a CDE,
then it is dealt with using the following policy:-
a. If read operation has been performed on any CDE, i.e. r(CDE) is present in the rule and
UV[i][r(CDE)] = 0 and UV[i][w(CDE)] = 0 for the given user, then the transaction is
termed as malicious.
b. If write operation has been performed on any CDE i.e. w(CDE) is encountered and
UV[i][w(CDE)] = 0 for the given user, then the transaction is termed as malicious.
Inputs Outputs
C1 C2 C3 C4 User Vector User profile
0.2 0.2 0.2 0.4 <U1001,0.2,0.109,0.9,
0.6>
<U1001,<0.2,0.1,0.9,0.6>,<0.2,0.2,0.2,0.4>>
23. Fig 3(b) Architecture of Testing Phase.
DAE Detector: This module addresses
the issue of inference attacks on CDEs.
As discussed earlier, certain data
elements can be used to access the CDEs,
i.e. first order inference. This module
uses the rules mined in the learning phase
to determine which elements can be used
to directly infer the DAEs.
Our system seeks to prevent inference
attacks by especially monitoring the
DAEs. We lay emphasis on write
operations on DAEs. If write operation
has been performed on any DAEs i.e.
w(DAE) is present in the rule to be
checked and UV[i][w(DAE)] = 0 for the
given user, then the transaction is termed
as malicious.
Dubiety Score Calculator and Analyser: If the transaction has not been found malicious in
the previous two modules, we check if the transaction is malicious based on the previous
history of the user and the behaviour pattern of all similar users (modified Jenson Shannon
distance). To do so, we maintain a record of action of all users by keeping the measure of
Dubiety Score(φi).
Algorithm 2: CDE Detector
Data: Set of rules (ϒ) from test transaction, Set
χCDE, UID, User Profile(ϴ)
Result: Checks whether the test transaction is
malicious or normal with respect to CDE
for Ѓє ϒ do
for ϱ є Ѓ do
if ϱ є χCDE then
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
end
if r(ϱ) є Ѓ & ϴ[UID][r(ϱ)] == 0 &
ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
end
end
end
end
24. The deviation of a user’s new
transaction with his normal access
pattern is referred to as Dubiety, and the
relative measure of Dubiety is the
Dubiety Score. Our IDS keeps a log of
the DS (Dubiety Score) in a separate
table. A user who is a potential threat
tends to have a high dubiety score.
Another intuition that our system
follows is that any transaction that a user
makes matches significantly either with
the transactions the same user or similar
users have made in the past.
We use a measure ds to keep a track
of the maximum similarity of the given
rule. We combine ds with φ i to get the final measure of dubiety score φ f for the given user. We
define 2 thresholds ФLT and ФUT. ФUT represents the upper limit for the dubiety score of a non-
malicious user whereas ФLT denotes the lower limit. This means that if φf for a user comes out
to be greater than ФUT, the user is malicious. On the other hand, φf value less than ФLT denotes
a benign user.
If the incoming rule (R1)
is a write rule, then the consequent
of the incoming rule is matched
with the corresponding rules in the
cluster of which a user is as part.
A user is said to be the part of the
ith cluster iff:
μi > 𝛿.
Where,
μi is the fuzzy membership
coefficient of the given user for
the ith cluster.
𝛿 is a user defined threshold.
If the incoming rule (R1)
is a read rule, then the antecedent of the incoming rule is matched with the corresponding
rules in the cluster of which a user is as part.
In order to quantitatively measure the similarity between two rules, we use modified
Jaccard distance[30]:
JD = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
R1 R2
R2| μi > 𝛿 and i [1, k]
Algorithm 4: Modified Jaccard Distance
Data: Rules R1, R2; 𝛿1, 𝛿2; Set χR1, χR2
Result: Distance between the two rules (Ԏ)
Function jcDistance (R1, R2)
for Ω є R1 do
χR1 ← Ω;
end
for Ω’ є R2 do
χR2 ← Ω’;
end
Ԏ =
𝛿1∗( 𝜒𝑅1 𝜒𝑅2)– 𝛿2∗( 𝜒𝑅1 𝜒𝑅2 – 𝜒𝑅1 𝜒𝑅2)
𝜒𝑅1 𝜒𝑅2
;
return Ԏ;
Algorithm 3: DAE Detector
Data: Set of rules (ϒ) from test transaction, Set
χDAE, UID, User Profile(ϴ)
Result: Checks whether the test transaction is
malicious or normal with respect to DAE
for Ѓє ϒ do
for ϱ є Ѓ do
if ϱ є χDAE then
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
end
end
end
end
25. The minimum value of JD is regarded as ds. φi is fetched directly from dubiety table.
Final dubiety score for the given user is calculated as:
φf =√ 𝑑𝑠 ∗ фi
If φf < ФLT, the transaction is termed as non-malicious. In this case, the current dubiety
score in the dubiety table for the given user is reduced by a factor known as
“amelioration factor(Å)”.
Thus, φi is updated as
φi = Å φi
If ФUT > φf ≥ ФLT, the transaction is termed as non-malicious and the dubiety table entry
for the given user is updated with φf.
If φf ≥ ФUT the transaction is termed as malicious.
As an Example, Let the initial dubiety table be:
Uid φ
1001 0.9
1002 0.8
1003 0.2
1004 0.6
1005 0.7
Table 3.6 Initial dubiety table
Let the minimum value of ds corresponding to each user be:
Uid ds
1001 0.2
1002 0.3
1003 0.2
1004 0.6
1005 0.3
Table 3.7 Minimum ds values for various Users
26. The calculated dubiety score table:
Uid φf= √ 𝑑𝑠 ∗ фi
1001 0.42
1002 0.49
1003 0.2
1004 0.6
1005 0.46
Table 3.8 calculated dubiety scores table
Taking ФLT=0.3 and ФUT=0.6
Uid φf Nature of
Transaction
Updated
φf
1001 0.42 Non-
malicious
0.42
1002 0.49 Non-
malicious
0.49
1003 0.2 Non-
malicious
0.198
1004 0.6 Malicious 0.6
1005 0.46 Non-
malicious
0.46
Table 3.9 Summary of transactions of various users
The Malicious Transactions are blocked in a straightforward fashion and the Non Malicious
transactions are processed. Updated Dubiety Table is stored in database.
27. 4. DISCUSSION
With regard to a typical credit card company dataset, some examples of critical data elements
(CDEs) are: -
1. CVV (denoted by a)
Card verification value (CVV) is a combination of features used in credit, debit and
automated teller machine (ATM) cards for the purpose of establishing the owner's identity
and minimizing the risk of fraud. The CVV is also known as the card verification code
(CVC) or card security code (CSC).
When properly used, the CVV is highly effective against some forms of fraud. For example,
if the data in the magnetic stripe is changed, a stripe reader will indicate a "damaged card"
error. The flat-printed CVV is (or should be) routinely required for telephone or Internet-
based purchases because it implies that the person placing the order has physical possession
of the card. Some merchants check the flat-printed CVV even when transactions are
conducted in person.
CVV technology cannot protect against all forms of fraud. If a card is stolen or the legitimate
user is tricked into divulging vital account information to a fraudulent merchant,
unauthorized charges against the account can result. A common method of stealing credit
card data is phishing, in which a criminal sends out legitimate-looking email in an attempt to
gather personal and financial information from recipients. Once the criminal has possession
of the CVV in addition to personal data from a victim, widespread fraud against that victim,
including identity theft, can occur.
The following are directly associated elements (DAEs) to CVV:-
a. Credit card number (denoted by b)
b. Name of card holder (denoted by c)
c. Card expiry date (denoted by d)
Credit Card Number, Name of card holder, Card expiry date are elements that are read before
CVV and hence used to validate the CVV entered by the user. Hence the above-mentioned
attributes have been classified as DAEs, by our system.
Some normal data attributes are: -
1. Gender of Customer (denoted by e)
2. Credit Limit (denoted by f)
3. Customer’s phone number (denoted by g)
These are the attributes that have been collected for the fraud detection and are not directly
used to access the CDE but are crucial for the process.
Some examples of transactions for our proposed approach:
R(b) → R(a)
R(b), R(c) → Ra)
28. 5. EXAMPLE TO OUR APPROACH
1. JC Distance
R1: R(c), R(b) → R(a)
R2: R(d), R(b) → R(a)
The modified JC Distance between R1 & R2 where the hyperparameters are 𝛿1 = 0.70 and 𝛿2 =
0.20, is calculated as
JC Distance = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
R1 R2
R1 R2 = 2
R1 R2 = 4
JC Distance = 0.75
2. User Profile Vector
B1 = <U1, <0.7, 0.1, 0.6, 0.2, 0.4, 0.0, 0.2, 0.0>, <0.2, 0.3, 0.1, 0.2, 0.167, 0.033> >
Here the values in the second tuple <0.7, …0.0> represent the probability of User U1
accessing particular attributes, for instance 0.7 denotes that there is a 70% probability that
U1 accesses the first attribute.
The values in the third tuple represent the membership of user U1 in the various(k) fuzzy
clusters, which is 6 in our case.
3. Dubiety Score
Suppose the Dubiety Score φi for User U1 is 0.8.
The JC Distance of the test transaction with its cluster is 0.6. Then,
φf =√ 𝑑𝑠 ∗ φi
φf =√0.6 ∗ 0.8 = 0.69
Setting our hyperparameter ФUT as 0.65. We observe that φf > ФUT. Hence the test transaction
is malicious, and an alarm is raised.
29. 6. EXPERIMENTATION
In this section, we describe the method of evaluation of the proposed algorithm. Firstly, we
describe our dataset. We then calculate various accuracy measures considering different
parameters as reference.
6.1 Description of dataset
This paper is about anomaly detection of user behaviours. An ideal dataset should be obtained
from a practical system with concrete job functions. But in fact, it is very sensitive for almost
every organization or company.
The performance of the algorithm was analysed by carrying out several experiments on
a credit card company dataset adhering to the TPC-C benchmark[18]. The TPC-C schema is
composed of a mixture of read only and read/write transactions that replicate the activities
found in complex OLTP application environment. The database schema, data population,
transactions, and implementation rules were designed to broadly represent modern OLTP
systems. We used two audit logs: one for training the model and the second for testing it. The
training log comprised of normal user transactions and testing log consisted of a mixture of
normal as well as malicious user transactions. Although there are unusual records in real
dataset, we also inject some anomalies for detection. The injected anomalies are set differently
with the normal behaviour pattern from several aspects. In totality, about 20,000 transactions
were used. In total, about 99% of data was non-malicious while less than 1% of data was
malicious. Fig. 6(a) shows the distribution of malicious and benign data in the dataset used:
Fig 6(a) Frequency of data items and their access frequency
The details of CDEs, DAEs and Normal data items has already been given in Section 3 and
examples have been discussed in Section 5.
30. The access pattern data hereby shows that CDEs are rarely accessed, that too only by a few
user roles and hence, protection of CDEs from malicious access is of a greater significance as
compared to DAEs and Normal data elements.
6.2 Cluster Analysis
When the number of users/user roles exceeds a given limit, it becomes exceedingly difficult
for the IDS to keep track of individual user access patterns and hence detect anomaly. This is
the reason that clustering is a better and computationally efficient solution for better
performance of IDS. We prefer Fuzzy clustering over hard clustering. Fuzzy clustering (also
referred to as soft clustering) is a form of clustering in which each data point can belong to
more than one cluster. In non-fuzzy clustering (also known as hard clustering), data is divided
into distinct clusters, where each data point can only belong to exactly one cluster. In fuzzy
clustering, data points can potentially belong to multiple clusters. Membership grades are
assigned to each of the data points(tags). These membership grades indicate the degree to which
data points belong to each cluster. Thus, points on the edge of a cluster, with lower membership
grades, may be in the cluster to a lesser degree than points in the center of cluster. When we
evaluate various performance measures keeping the number of clusters as a reference
parameter, it is observed that a particular count for clusters is the most efficient in predicting
results.
Fig 6(b) Variation of performance with number of clusters
Fig 6(b) depicts variation in precision, recall, TNR, accuracy with change in number of clusters.
From the graph, we can see that :-
TNR does not vary with the number of clusters, i.e. TNR is invariant.
The precision is always greater than 0.94 and is more or less constant.
Recall reaches optimum value when number of Fuzzy Clusters is greater than 3.
Accuracy also reaches the optimum value when number of clusters is greater than 3.
31. 6.3 Distances and thresholds
In section 3.2, we have described Modified Jensen-Shannon distance as a measure to calculate
distance between two user vectors of same length. In probability theory and statistics, the
Jensen–Shannon divergence is a method of measuring the similarity between two probability
distributions. It is also known as information radius (IRad) or total divergence to the average.
It is based on the Kullback–Leibler divergence, with some notable (and useful) differences,
including that it is symmetric and it is always a finite value. The square root of the Jensen–
Shannon divergence is a metric often referred to as Jensen-Shannon distance. We preferred to
use modified Jenson-Shannon distance to give weights to data attributes and avoid curse of
dimensionality. The variation of modified Jenson-Shannon distance with Euclidean distance is
shown in the fig 6(g).
In section 3.3, we have defined modified Jaccard distance to quantitatively measure the
similarity between two rules. The Jaccard index, also known as Intersection over Union of the
Jaccard similarity coefficient, is a statistical measure used for comparing the similarity and
diversity of sample sets. The Jaccard coefficient measures similarity between finite sample
sets, and is defined as the size of the intersection divided by the size of the union of the sample
sets. The variation of modified Jaccard index with Jaccard index is shown in fig 6(h).
The variation of precision, recall, TNR, accuracy with the various thresholds, namely 𝛿1, 𝛿2,
фUT , фLT that were defined in section 3 is shown in the following figures:
Fig 6(c) shows the variation of Precision, recall, TNR, accuracy with 𝛿1. It can be observed
from the graph that Precision, TNR and Accuracy increase with the increase in value of 𝛿1,
while the value of Recall decreases with increase in value of 𝛿1.
Fig 6(e) shows the variation of Precision, recall, TNR, accuracy with 𝛿2. It can be observed
from the graph that the value of Precision, TNR and Accuracy starts decreasing when the value
of 𝛿2 increases beyond a certain value. Recall, on the other hand, increases for higher values
of 𝛿2.
Fig 6(d) shows the variation of Precision, recall, TNR, accuracy with фUT. It can be observed
from the graph that the value of Precision first decreases and then exponentially increases with
the increase in value of фUT. An identical trend is followed by Accuracy. Somewhat similar
trend is followed by TNR except that it does not decrease initially. On the contrary, the value
of Recall decreases with the increase in value of фUT.
Fig 6(f) shows the variation of Precision, recall, TNR, accuracy with фLT. It can be observed
from the graph that the values of all the parameters fluctuate a little but remain more or less
constant with the increase in value of фLT.
With regards to the dataset we have used, following inferences can be done from the graphs:
1. Value of 𝛿1 should be close to 0.65 for optimum performance.
2. Value of 𝛿2 should be close to 0.55 for optimum performance.
3. Value of фUT should be close to 0.59 for optimum performance.
4. Value of фLT should be close to 0.2 for optimum performance.
32. 6.4 Comparison with related methods
Table 1 shows the performance measures used for comparison of approaches. Using these
performance measures, we will compare our approaches with other related works. Our various
approaches are:-
Approach 1. Our approach using modified Jenson-Shanon distance and modified Jaccard
index.
Approach 2. Using unmodified Jaccard index with Jenson-Shanon distance.
Approach 3. Using Euclidean distance with unmodified Jaccard index.
The various performance measures used for comparison of approaches are shown in Table
6.1.
34. From the table, following observations can be made:-
If we compare Approach 1 with Approach 2, we can observe that:-
TNR and FPR of Approach 1 is a lot better than the TNR and precision for Approach 2.
Approach 1 has also got better accuracy as compared to Approach 2.
Approach 1 has a much lower FPR and FDR score as compared to Approach 2.
Amongst other performance measures, MK and MCC values of approach 1 are also better
than that of Approach 2.
Approach 2, on the other hand has got better TPR, NPV and FOR measures as compared
to Approach 1.
Both Approach 1 and Approach 2 have got somewhat similar F1 score.
In the measures like FPR and TNR where Approach 1 has good performance, Approach 2
performs rather poorly. However, in measures like TPR and NPV, where Approach 2 performs
better, Approach 1 also has good performance. For example, both Approach 1 and Approach 2
have similar NPV scores with Approach 2 performing slightly better As Approach 1 performs
far better than Approach 2 in most of the measures, we can conclude that the overall
performance of Approach 1 is better than Approach 2.
If we compare Approach 1 with Approach 3, we observe that:-
TNR and precision of Approach 1 is a lot better than the TNR and precision for Approach3
Approach 1 has also got better accuracy as compared to Approach 3.
Approach 1 also has a much lower FPR and FDR score as compared to Approach 3.
Amongst other performance measures, MK and MCC values of Approach 1 are also
slightly better than that of Approach 3.
Approach 3, on the other hand has got better TPR, NPV and FOR measures as compared
to Approach 1. In fact, it has the best values for these parameters in the entire table.
Also, both Approach 1 and Approach 3 have got somewhat similar F1 score.
In the measures like TNR and precision, where Approach 1 has one of the best scores in the
entire table, Approach 3 performs rather poorly. Also, Approach 3 lags far behind in measures
like FPR and FDR score. On the other hand, in the measures in which Approach 3 performs
better than Approach 1, Approach 1 is also performing quiet nicely. For example, in case of
NPV, both the approaches have good scores, with Approach 3 performing better. similar trends
are observed in case of all other measures except FNR, where Approach 3 has is far superior.
Considering all the above scenario, we can say that the overall even though Approach 3 has
the best values for some performance measures, its poor performance in other measures are
clearly a disadvantage due to which Approach 1 is better than Approach 3.
Table 3 shows a comparison of our approaches with various other related works.
35. Table 6.3 Comparison of our approaches with related works
If we compare our approach with other related approaches, we observe that:-
In comparison to HU Panda, our approach works better with respect to all the performance
measures considered for the purpose of comparison.
In comparison to the work of Mostafa et al. our approach performs better with respect to
all the performance measures that are considered for comparison.
In comparison to the work of Hashemi et al. even though our approach scores just a little
less in measures like TNR and precision, it scores a lot better with respect to rest of the
performance measures
If we consider the work of Mina Sohrabi et al. our approach performs better with respect
to all the performance measures that are present in the table.
In comparison to the work of Majumdar et al. our approach performs better with respect to
all the performance measures that we have considered for the purpose of comparison.
With comparison to the work of UP Rao et al. our approach performs better in context to
all the measures that are considered in the table for comparison.
In comparison to the work of Elisa Bertino, our approach gives better TNR and precision
scores. It also gives comparatively better FDR and FPR scores. In other measures, except
TPR and recall, both approaches have somewhat similar score. Since our work is mostly
related to finding Critical Data Items in a dataset, higher TNR and precision scores are
more desirable as compared to other performance measures. Since our approach performs
quiet well with respect to other performance measures as well, better TNR and precision
scores can easily cover up lower recall values.
Sensitivity
Measures
Approach
1
Approach
2
Approach
3
HU
Panda
Hashemi
et al.
Mostafa
et al.
Mina
Sohrabi
et al.
Majumdar
et al.
(2006)
EliSa
Bertino
et al.
UP Rao
et
al.(2016)
PPV 0.96 0.73 0.74 0.88 0.97 0.94 0.93 0.88 0.94 0.61
TPR 0.81 0.95 1.00 0.73 0.71 0.75 0.66 0.70 0.91 0.70
ACC 0.89 0.80 0.83 0.81 0.84 0.85 0.80 0.80 0.93 0.64
F1 Score 0.88 0.83 0.85 0.79 0.82 0.83 0.77 0.78 0.92 0.65
NPV 0.83 0.93 1.00 0.77 0.77 0.79 0.73 0.75 0.91 0.68
FDR 0.04 0.27 0.26 0.12 0.03 0.06 0.07 0.13 0.06 0.39
FOR 0.17 0.07 0.00 0.23 0.23 0.21 0.27 0.25 0.09 0.32
BM 0.77 0.60 0.65 0.63 0.69 0.70 0.60 0.60 0.85 0.35
FPR 0.03 0.34 0.34 0.10 0.02 0.05 0.05 0.10 0.06 0.45
TNR 0.96 0.65 0.65 0.90 0.98 0.95 0.94 0.90 0.94 0.65
FNR 0.19 0.05 0.00 0.28 0.29 0.25 0.35 0.30 0.09 0.30
MK 0.79 0.66 0.74 0.65 0.74 0.73 0.66 0.63 0.85 0.29
MCC 0.78 0.63 0.70 0.63 0.72 0.71 0.63 0.61 0.85 0.29
36. In comparison to the work of Elisa Bertino, our approach gives better TNR and precision
scores. It also gives comparatively better FDR and FPR scores. In other measures, except
TPR and recall, both approaches have somewhat similar score. Since our work is mostly
related to finding Critical Data Items in a dataset, higher TNR and precision scores are
more desirable as compared to other performance measures. Since our approach performs
quiet well with respect to other performance measures as well, better TNR and precision
scores can easily cover up lower recall value
37. 7. CONCLUSION AND FUTURE WORKS
In this paper we have tried to detect malicious transactions with a perspective that certain data
elements hold more critical information than others. Inference attacks against such data
elements are blocked by taking into account user access pattern and also the historic behaviour.
A user who regularly behaves as a normal user is gradually allowed to improve his dubiety
score. The approach is analysed with respect to different performance parameters by
conducting experiments. Finally, it may be concluded that the approach works efficiently in
determining the nature of a transaction.
We plan to extend our approach from 2-level inference control to n-level inference control,
whereby nth order statements will be encoded to attribute-hierarchy and then the n-level
attribute tree/graph will be manipulated to form fuzzy clusters and the incoming transactions
will be checked against nth access level. Automatic manipulation of semantics to classify
attributes as critical data elements may also be considered as future research topic.
38. 8. REFERENCES
1. I-Yuan Lin; Xin-Mao Huang; Ming-Syan Chen “Capturing user access patterns in the Web
for data mining” Proceedings 11th International Conference on Tools with Artificial
Intelligence, IEEE pp 9-11 Nov. 1999
2. R.S. Sandhu; P. Samarati “Access control: principle and practice” Published in: IEEE
Communications Magazine (Volume: 32, Issue: 9, Sept. 1994)
3. Denning, D.E. (1987) An Intrusion Detection Model. IEEE Transactions on Software
Engineering, Vol. SE-13, 222-232.
4. Knuth, Donald E., James H. Morris, Jr, and Vaughan R. Pratt. "Fast pattern matching in
strings." SIAM journal on computing 6.2 (1977): 323-350.
5. Wang, Ke. “Anomalous Payload-Based Network Intrusion Detection”. Recent Advances
in Intrusion Detection. Springer Berlin. doi:10.1007/978-3-540-30143-1_11
6. Douligeris, Christos; Serpanos, Dimitrios N. (2007-02-09). Network Security: Current
Status and Future Directions. John Wiley & Sons. ISBN 9780470099735.
7. Christina Yip Chung, Michael Gertz and Karl Levitt (2000), “DEMIDS: a misuse detection
system for database systems”, Integrity and internal control information systems: strategic
views on the need for control, Kluwer Academic Publishers, Norwell, MA.
8. A. S. McGough, D. Wall, J. Brennan, G. Theodoropoulos, E. Ruck-Keene, B. Arief, et al.,
"Insider Threats: Identifying Anomalous Human Behaviour in Heterogeneous Systems
Using Beneficial Intelligent Software (Ben-ware)", presented at the Proceedings of the 7th
ACM CCS International Workshop on Managing Insider Security Threats, Denver,
Colorado, USA, 2015.
9. S. D. Bhattacharjee, J. Yuan, Z. Jiaqi, and Y.-P. Tan, "Context-aware graph-based analysis
for detecting anomalous activities", presented at the Multimedia and Expo (ICME), 2017
IEEE International Conference on, 2017.
10. P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese, "Automated insider threat detection
system using user and role-based profile assessment", IEEE Systems Journal, vol. 11, pp.
503-512, 2015.
11. I. Agrafiotis, A. Erola, J. Happa, M. Goldsmith, and S. Creese, "Validating an Insider
Threat Detection System: A Real Scenario Perspective", presented at the 2016 IEEE
Security and Privacy Workshops (SPW), 2016.
12. T. Rashid, I. Agrafiotis, and J. R. C. Nurse, "A New Take on Detecting Insider Threats:
Exploring the Use of Hidden Markov Models", presented at the Proceedings of the 8th
ACM CCSInternational Workshop on Managing Insider Security Threats, Vienna, Austria,
2016.
13. Zamanian Z., Feizollah A., Anuar N.B., Kiah L.B.M., Srikanth K., Kumar S. (2019) User
Profiling in Anomaly Detection of Authorization Logs. In: Alfred R., Lim Y., Ibrahim A.,
Anthony P. (eds) Computational Science and Technology. Lecture Notes in Electrical
Engineering, vol 481. Springer, Singapore
14. Yuqing Sun, Haoran Xu, Elisa Bertino, and Chao Sun. 2016. A Data-Driven Evaluation for
Insider Threats. Data Science and Engineering Vol. 1, 2 (2016), 73--85.
doi:10.1007/s41019-016-0009-x
39. 15. S. Panigrahi, S. Sural and A. K. Majumdar, "Detection of intrusive activity in databases by
combining multiple evidences and belief update," 2009 IEEE Symposium on
Computational Intelligence in Cyber Security, Nashville, TN, 2009, pp. 83-90. doi:
10.1109/CICYBS.2009.4925094
16. Yi Hu, Bajendra Panda, A data mining approach for database intrusion detection, SAC '04
Proceedings of the 2004 ACM symposium on Applied computing Pages 711-716,
doi:10.1145/967900.968048
17. Abhinav Srivastava , Shamik Sural , A. K. Majumdar, Weighted intra-transactional rule
mining for database intrusion detection, Proceedings of the 10th Pacific-Asia conference
on Advances in Knowledge Discovery and Data Mining, April 09-12, 2006, Singapore
doi:10.1007/11731139_71
18. TPC-C benchmark: http://www.tpc.org/tpcc/default.asp
19. Mina Sohrabi, M. M. Javidi, S. Hashemi, “Detecting intrusion transactions in database
systems: a novel approach”, Journal of Intelligent Info Systems 42:619-644 doi: 10.1007
Springer 2014.
20. UP Rao et. al ,“Weighted Role Based Data Dependency Approach for Intrusion Detection
in Database”, International Journal of Network Security, Vol.19, No.3, PP.358-370, May
2017 (doi: 10.6633/IJNS.201703.19(3).05).
21. R. Agrawal, T. lmieliiski, and A. Swami, “Mining Association Rules between Sets of Items
in Large Databases”, in Proceedings of the 1993 ACM SIGMOD International Conference
on Management of data, 1993.
22. Sattar Hashemi, Ying Yang,Davoud Zabihzadeh and Mohammadreza Kangavari,
“Detecting intrusion transactions in databases using data item dependencies and anomaly
analysis”, Article in Expert Systems 25(5):460-473. November 2008 doi: 10.1111/j.1468-
0394.2008.00467.
23. Mostafa Doroudian, Hamid Reza Shahriari, “A Hybrid Approach for Database Intrusion
Detection at Transaction and Inter-transaction Levels”, 6th Conference on Information and
Knowledge Technology (IKT 2014), May 28-30, 2014, Shahrood University of
Technology, Tehran, Iran.
24. E. Bertino, A. Kamra, E. Terzi and A. Vakali (2005), "Intrusion detection in RBAC
administered databases", in Proceedings of the Applied Computer Security Applications
Conference (ACSAC).
25. Lee, V. C.S., Stankovic, J. A., Son, S. H. Intrusion Detection in Real-time Database
Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time Technology
and Applications Symposium, 2000.
26. Weina Wang, Yunjie Zhang, Yi Li and Xiaona Zhang (2006), "The Global Fuzzy C-Means
Clustering Algorithm", 2006 6th World Congress on Intelligent Control and Automation,
Dalian, 2006, pp. 3604- 3607.
27. Fuglede, Bent; Topsøe, Flemming (2004). "Jensen-Shannon divergence and Hilbert space
embedding - IEEE Conference Publication".
28. Dunn, J. C. (1973-01-01). "A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57.
doi:10.1080/01969727308546046. ISSN 0022-0280.
40. 29. A. Mangalampalli and V. Pudi (2009), "Fuzzy association rule mining algorithm for fast
and efficient performance on very large datasets", 2009 IEEE International Conference on
Fuzzy Systems, Jeju Island, 2009, pp. 1163-1168
30. Vorontsov, I.E., Kulakovskiy, I.V. & Makeev, V.J. Algorithms Mol Biol (2013) 8: 23.
“Jaccard index based similarity measure to compare transcription factor binding site
models” doi: 10.1186/1748-7188-8-23
42. LIST OF ABBREVIATIONS USED
1. CDE: Critical Data Elements
2. DAE: Directly Accessed Elements
3. IDS: Intrusion Detection System
4. SPM: Sequential Pattern Mining
5. FPM: Frequent Pattern Mining
6. ARM: Association Rule Mining
7. DT: Dubiety Table
8. FPR: False Positive Rate
9. TPR: True Positive Rate
10. FNR: False Negative Rate
11. TNR: True Negative Rate
12. UV: User Vector
13. Uid: User ID
14. Cid: Cluster ID
15. JCD: Jaccard Distance
16. JSD: Jenson Shannon Distance
17. KL: Kulback Liebler
43. LIST OF FIGURES
1. Fig 3 (a) Architecture of Learning Phase.
2. Fig 3 (b) Architecture of Testing Phase.
3. Fig 6 (a) Distribution of data in the dataset.
4. Fig 6 (b) Variation of performance measures with number of clusters.
5. Fig 6 (c) Variation of Precision, recall, TNR, accuracy with 𝛿1.
6. Fig 6 (d) Variation of Precision, recall, TNR, accuracy with фLT.
7. Fig 6 (e) Variation of Precision, recall, TNR, accuracy with 𝛿2.
8. Fig 6 (f) Variation of Precision, recall, TNR, accuracy with фUT.
9. Fig 6 (g) Variation of modified Jenson-Shannon distance with Euclidean distance.
10. Fig 6 (h) Variation of modified Jenson-Shannon distance with Euclidean distance.
44. LIST OF TABLES
1. Table 3.1 Types of attributes and their sensitivity levels
2. Table 3.2 Initial Dubiety Table
3. Table 3.3 Updated Dubiety Table
4. Table 3.4 Rule Generator for given Example
5. Table 3.5 User profile for the given Example
6. Table 3.6 Initial dubiety table
7. Table 3.7 Minimum ds values for various Users
8. Table 3.8 calculated dubiety scores table
9. Table 3.9 Summary of transactions of various users
10. Table 6.1 Performance Measures
11. Table 6.2 Comparison of our approaches
12. Table 6.3 Comparison of our approaches with related works