Document clustering for forensic analysis an approach for improving computer inspection

Document Clustering for Forensic
Analysis: An Approach for Improving
Computer Inspection

ABSTRACT:
In computer forensic analysis, hundreds of thousands of files are usually
examined. Much of the data in those files consists of unstructured text,
whose analysis by computer examiners is difficult to be performed.

In particular, algorithms for clustering documents can facilitate the discovery
of new and useful knowledge from the documents under analysis.

The present an approach that applies document clustering algorithms to
forensic analysis of computers seized in police investigations.

Introction To DataMining
•Data mining refers to as extracting knowledge from the data
•It is the computational process of discovering patterns in large data
sets involving methods at the intersection ofartificial intelligence, machine
learning, statistics, and database systems.
•The overall goal of the data mining process is to extract information from
a data set and transform it into an understandable structure for further use.

Clustering-Introduction
What is Clustering?

Clustering can be considered the most important unsupervised learning
problem; so, as every other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.

Forensic Computing-Introduction

•Digital forensics is a branch of forensic science encompassing The
recovery and investigation of material found in digital devices Or often in
relation to computer crime.
•Computer forensics is the application of investigation and analysis
techniques To gather and preserve them.

Text Mining-Definition

•Text mining also refered to as text data mining,roughly equivalent to
text analysis

•Text mining is the analysis of data contained in natural language text
•The application of text mining techniques to solve problems is called
text analysis

EXISTING SYSTEM:

•Clustering algorithms are typically used for exploratory data analysis
•Where there is little or no prior knowledge about the data
•More precisely, it is likely that the new data sample would come from a
different population
•By doing so ,one can avoid the hard task of examing all the
documents(individually) but ,even if so desired, it still could be done

DISADVANTAGES OF EXISTING SYSTEM:
•The literature on Computer Forensics only reports the use of algorithms that
assume that the number of clusters is known and fixed a priori by the user.
•A common approach in other domains involves estimating the number of clusters
from data.

K-means Algorithm Definition

•The K-Means algorithm is an method to cluster objects based on their
attributes into k partitions.
•It assumes that the k clusters exhibit Gaussian distributions.
•It assumes that the object attributes form a vector space.

•The objective it tries to achieve is to minimize total intra-cluster variance.

K-means example
•Problem: Cluster the following eight points (with (x, y) representing
locations)
•into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)
A7(1, 2) A8(4, 9).
•Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

•The distance function between two points a=(x1, y1) and b=(x2, y2)
•is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| .
•Use k-means algorithm to find the three cluster centers after the second
iteration.

Solution:

A1

Point
(2, 10)

A2

(5, 8)

A5

(7, 5)

A6

(6, 4)

A7

(1, 2)

A8

(1, 2)
Dist Mean 3

(8, 4)

A4

(5, 8)
Dist Mean 2

(2, 5)

A3

(2, 10)
Dist Mean 1

Cluster

(4, 9)

First we list all points in the first column of the table above. The initial
cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly.
Next, we will calculate the distance from the first point (2, 10) to each of
the three means, by using the distance function:

Solution:(Cont..)
point
mean1
x1, y1
x2, y2
(2, 10)
(2, 10)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0+0
=0
point
mean2
x1, y1
x2, y2
(2, 10)
(5, 8)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
point
mean3
x1, y1
x2, y2
(2, 10)
(1, 2)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9

Analogically, we fill in the rest of the table, and place each point in one of the
clusters:
(2, 10)

(5, 8)

(1, 2)

Point

Dist Mean 1

Dist Mean 2

Dist Mean 3

Cluster

A1

(2, 10)

0

5

9

1

A2

(2, 5)

5

6

4

3

A3

(8, 4)

12

7

9

2

A4

(5, 8)

5

0

10

2

A5

(7, 5)

10

5

9

2

A6

(6, 4)

10

5

7

2

A7

(1, 2)

9

10

0

3

A8

(4, 9)

3

2

10

2

PROPOSED SYSTEM:
•I present an approach that applies document clustering algorithms to forensic
analysis of computers seized in police investigations.
•Clustering algorithms have been studied for decades, and the literature on the
subject is huge.
•We decided to choose a set of (six) representative algorithm in order to show
the potential of the proposed approach, namely: the partitional K-means and
K-medoids, the hierarchical Single/Complete/Average Link, and the cluster
ensemble algorithm known as CSPA.
•In order to make the comparative analysis of the algorithms more realistic, two
relative validity indexes have been used to estimate the number of clusters
automatically ZCfrom data.

ADVANTAGES OF PROPOSED SYSTEM:

•Evaluation of the proposed approach in applications show that it has the
potential to speed up the computer inspection process.

ESTIMATING THE NUMBER OF CLUSTERS FROM DATA

A widely used relative validity index is the so-called silhouette
Let us consider an object belonging to cluster . The average dissimilarity of to
all other objects of A is denoted by a(i) . Now let us take into account cluster C.
The average dissimilarity of i to all objects of C will be called d(i,C) . After
computing d(i,C) for all clusters C!=A, the smallest one is selected,
i.e.,b(i)=min d(i,C) ,C!=A . This value represents the dissimilarity of to its
neighbor cluster, and the silhouette for a give object, s(i), is given by:

Thus, the higher the better the assignment of object to a given cluster

CONCLUSION
•With the proposed concept we can estimate the number of clusters
automatically to achive get good results.

Document clustering for forensic analysis an approach for improving computer inspection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Document clustering for forensic analysis an approach for improving computer inspection

Similar to Document clustering for forensic analysis an approach for improving computer inspection (20)

More from Madan Golla

More from Madan Golla (9)

Recently uploaded

Recently uploaded (20)

Document clustering for forensic analysis an approach for improving computer inspection