SlideShare a Scribd company logo
1 of 19
Document Clustering for Forensic
Analysis: An Approach for Improving
Computer Inspection
ABSTRACT:
In computer forensic analysis, hundreds of thousands of files are usually
examined. Much of the data in those files consists of unstructured text,
whose analysis by computer examiners is difficult to be performed.

In particular, algorithms for clustering documents can facilitate the discovery
of new and useful knowledge from the documents under analysis.

The present an approach that applies document clustering algorithms to
forensic analysis of computers seized in police investigations.
Introction To DataMining
•Data mining refers to as extracting knowledge from the data
•It is the computational process of discovering patterns in large data
sets involving methods at the intersection ofartificial intelligence, machine
learning, statistics, and database systems.
•The overall goal of the data mining process is to extract information from
a data set and transform it into an understandable structure for further use.
Clustering-Introduction
What is Clustering?

Clustering can be considered the most important unsupervised learning
problem; so, as every other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.
Forensic Computing-Introduction

•Digital forensics is a branch of forensic science encompassing The
recovery and investigation of material found in digital devices Or often in
relation to computer crime.
•Computer forensics is the application of investigation and analysis
techniques To gather and preserve them.
Text Mining-Definition

•Text mining also refered to as text data mining,roughly equivalent to
text analysis

•Text mining is the analysis of data contained in natural language text
•The application of text mining techniques to solve problems is called
text analysis
EXISTING SYSTEM:

•Clustering algorithms are typically used for exploratory data analysis
•Where there is little or no prior knowledge about the data
•More precisely, it is likely that the new data sample would come from a
different population
•By doing so ,one can avoid the hard task of examing all the
documents(individually) but ,even if so desired, it still could be done
DISADVANTAGES OF EXISTING SYSTEM:
•The literature on Computer Forensics only reports the use of algorithms that
assume that the number of clusters is known and fixed a priori by the user.
•A common approach in other domains involves estimating the number of clusters
from data.
K-means Algorithm Definition

•The K-Means algorithm is an method to cluster objects based on their
attributes into k partitions.
•It assumes that the k clusters exhibit Gaussian distributions.
•It assumes that the object attributes form a vector space.

•The objective it tries to achieve is to minimize total intra-cluster variance.
K-means example
•Problem: Cluster the following eight points (with (x, y) representing
locations)
•into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)
A7(1, 2) A8(4, 9).
•Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

•The distance function between two points a=(x1, y1) and b=(x2, y2)
•is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| .
•Use k-means algorithm to find the three cluster centers after the second
iteration.
Solution:

A1

Point
(2, 10)

A2

(5, 8)

A5

(7, 5)

A6

(6, 4)

A7

(1, 2)

A8

(1, 2)
Dist Mean 3

(8, 4)

A4

(5, 8)
Dist Mean 2

(2, 5)

A3

(2, 10)
Dist Mean 1

Cluster

(4, 9)

First we list all points in the first column of the table above. The initial
cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly.
Next, we will calculate the distance from the first point (2, 10) to each of
the three means, by using the distance function:
Solution:(Cont..)
point
mean1
x1, y1
x2, y2
(2, 10)
(2, 10)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0+0
=0
point
mean2
x1, y1
x2, y2
(2, 10)
(5, 8)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
point
mean3
x1, y1
x2, y2
(2, 10)
(1, 2)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
Analogically, we fill in the rest of the table, and place each point in one of the
clusters:
(2, 10)

(5, 8)

(1, 2)

Point

Dist Mean 1

Dist Mean 2

Dist Mean 3

Cluster

A1

(2, 10)

0

5

9

1

A2

(2, 5)

5

6

4

3

A3

(8, 4)

12

7

9

2

A4

(5, 8)

5

0

10

2

A5

(7, 5)

10

5

9

2

A6

(6, 4)

10

5

7

2

A7

(1, 2)

9

10

0

3

A8

(4, 9)

3

2

10

2
PROPOSED SYSTEM:
•I present an approach that applies document clustering algorithms to forensic
analysis of computers seized in police investigations.
•Clustering algorithms have been studied for decades, and the literature on the
subject is huge.
•We decided to choose a set of (six) representative algorithm in order to show
the potential of the proposed approach, namely: the partitional K-means and
K-medoids, the hierarchical Single/Complete/Average Link, and the cluster
ensemble algorithm known as CSPA.
•In order to make the comparative analysis of the algorithms more realistic, two
relative validity indexes have been used to estimate the number of clusters
automatically ZCfrom data.
ADVANTAGES OF PROPOSED SYSTEM:

•Evaluation of the proposed approach in applications show that it has the
potential to speed up the computer inspection process.
ESTIMATING THE NUMBER OF CLUSTERS FROM DATA

A widely used relative validity index is the so-called silhouette
Let us consider an object belonging to cluster . The average dissimilarity of to
all other objects of A is denoted by a(i) . Now let us take into account cluster C.
The average dissimilarity of i to all objects of C will be called d(i,C) . After
computing d(i,C) for all clusters C!=A, the smallest one is selected,
i.e.,b(i)=min d(i,C) ,C!=A . This value represents the dissimilarity of to its
neighbor cluster, and the silhouette for a give object, s(i), is given by:

Thus, the higher the better the assignment of object to a given cluster
CONCLUSION
•With the proposed concept we can estimate the number of clusters
automatically to achive get good results.
THANKYOU
?

More Related Content

What's hot

3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Modelsguest0edcaf
 
Query evaluation and optimization
Query evaluation and optimizationQuery evaluation and optimization
Query evaluation and optimizationlavanya marichamy
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clusteringCosmoAIMS Bassett
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
web clustering engines
web clustering enginesweb clustering engines
web clustering enginesArun TR
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological dataKrish_ver2
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysisPushkar Mishra
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised LearningSAHEEL FAL DESAI
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 
Lect4
Lect4Lect4
Lect4sumit621
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering enginesunyil96
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesOmprakash Chauhan
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 

What's hot (20)

3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
Query evaluation and optimization
Query evaluation and optimizationQuery evaluation and optimization
Query evaluation and optimization
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
web clustering engines
web clustering enginesweb clustering engines
web clustering engines
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Lect4
Lect4Lect4
Lect4
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - Notes
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 

Viewers also liked

Computer forensic
Computer forensicComputer forensic
Computer forensicbhavithd
 
BoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic Examiners
BoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic ExaminersBoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic Examiners
BoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic ExaminersBoyarMiller
 
Computer Forensics in Fighting Crimes
Computer Forensics in Fighting CrimesComputer Forensics in Fighting Crimes
Computer Forensics in Fighting CrimesIsaiah Edem
 
Business Intelligence (BI) Tools For Computer Forensic
Business Intelligence (BI) Tools For Computer ForensicBusiness Intelligence (BI) Tools For Computer Forensic
Business Intelligence (BI) Tools For Computer ForensicDhiren Gala
 
Introduction to computer forensic
Introduction to computer forensicIntroduction to computer forensic
Introduction to computer forensicOnline
 
Computer forensic 101 - OWASP Khartoum
Computer forensic 101 - OWASP KhartoumComputer forensic 101 - OWASP Khartoum
Computer forensic 101 - OWASP KhartoumOWASP Khartoum
 
Digital Evidence in Computer Forensic Investigations
Digital Evidence in Computer Forensic InvestigationsDigital Evidence in Computer Forensic Investigations
Digital Evidence in Computer Forensic InvestigationsFilip Maertens
 
Personalised Search: Historical & Geographical Factors
Personalised Search: Historical & Geographical FactorsPersonalised Search: Historical & Geographical Factors
Personalised Search: Historical & Geographical FactorsPhill Ohren
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
Personalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing WebsitesPersonalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing WebsitesAM Publications
 
How to Realize the Benefits of Cloud Services Brokerage
How to Realize the Benefits of Cloud Services BrokerageHow to Realize the Benefits of Cloud Services Brokerage
How to Realize the Benefits of Cloud Services Brokeragejamcracker4677
 
The design of forensic computer workstations
The design of forensic computer workstationsThe design of forensic computer workstations
The design of forensic computer workstationsjkvr100
 
MattockFS Computer Forensic File-System
MattockFS Computer Forensic File-SystemMattockFS Computer Forensic File-System
MattockFS Computer Forensic File-SystemRob Meijer
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionSOYEON KIM
 
Computer forensic ppt
Computer forensic pptComputer forensic ppt
Computer forensic pptOnkar1431
 
Electornic evidence collection
Electornic evidence collectionElectornic evidence collection
Electornic evidence collectionFakrul Alam
 
Capturing forensics image
Capturing forensics imageCapturing forensics image
Capturing forensics imageChris Harrington
 

Viewers also liked (20)

Computer forensic
Computer forensicComputer forensic
Computer forensic
 
BoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic Examiners
BoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic ExaminersBoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic Examiners
BoyarMiller - You Lost Me At Gigabyte: Working with Computer Forensic Examiners
 
Computer forensic
Computer forensicComputer forensic
Computer forensic
 
Computer Forensics in Fighting Crimes
Computer Forensics in Fighting CrimesComputer Forensics in Fighting Crimes
Computer Forensics in Fighting Crimes
 
Business Intelligence (BI) Tools For Computer Forensic
Business Intelligence (BI) Tools For Computer ForensicBusiness Intelligence (BI) Tools For Computer Forensic
Business Intelligence (BI) Tools For Computer Forensic
 
Introduction to computer forensic
Introduction to computer forensicIntroduction to computer forensic
Introduction to computer forensic
 
Computer forensic 101 - OWASP Khartoum
Computer forensic 101 - OWASP KhartoumComputer forensic 101 - OWASP Khartoum
Computer forensic 101 - OWASP Khartoum
 
Digital Evidence in Computer Forensic Investigations
Digital Evidence in Computer Forensic InvestigationsDigital Evidence in Computer Forensic Investigations
Digital Evidence in Computer Forensic Investigations
 
Personalised Search: Historical & Geographical Factors
Personalised Search: Historical & Geographical FactorsPersonalised Search: Historical & Geographical Factors
Personalised Search: Historical & Geographical Factors
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Disease Detection System
Disease Detection SystemDisease Detection System
Disease Detection System
 
Personalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing WebsitesPersonalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing Websites
 
How to Realize the Benefits of Cloud Services Brokerage
How to Realize the Benefits of Cloud Services BrokerageHow to Realize the Benefits of Cloud Services Brokerage
How to Realize the Benefits of Cloud Services Brokerage
 
The design of forensic computer workstations
The design of forensic computer workstationsThe design of forensic computer workstations
The design of forensic computer workstations
 
MattockFS Computer Forensic File-System
MattockFS Computer Forensic File-SystemMattockFS Computer Forensic File-System
MattockFS Computer Forensic File-System
 
IR
IRIR
IR
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS Detection
 
Computer forensic ppt
Computer forensic pptComputer forensic ppt
Computer forensic ppt
 
Electornic evidence collection
Electornic evidence collectionElectornic evidence collection
Electornic evidence collection
 
Capturing forensics image
Capturing forensics imageCapturing forensics image
Capturing forensics image
 

Similar to Document clustering for forensic analysis an approach for improving computer inspection

Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Data clustering
Data clustering Data clustering
Data clustering GARIMA SHAKYA
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 

Similar to Document clustering for forensic analysis an approach for improving computer inspection (20)

Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
Clustering
ClusteringClustering
Clustering
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
50120140505013
5012014050501350120140505013
50120140505013
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Data clustering
Data clustering Data clustering
Data clustering
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 

More from Madan Golla

Optimizing cloud resources for delivering iptv services through virtualization
Optimizing cloud resources for delivering iptv services through virtualizationOptimizing cloud resources for delivering iptv services through virtualization
Optimizing cloud resources for delivering iptv services through virtualizationMadan Golla
 
Enabling dynamic data and indirect mutual trust for cloud computing storage s...
Enabling dynamic data and indirect mutual trust for cloud computing storage s...Enabling dynamic data and indirect mutual trust for cloud computing storage s...
Enabling dynamic data and indirect mutual trust for cloud computing storage s...Madan Golla
 
Emap expedite message authentication protocol for vehicular ad hoc networks
Emap expedite message authentication protocol     for vehicular ad hoc networksEmap expedite message authentication protocol     for vehicular ad hoc networks
Emap expedite message authentication protocol for vehicular ad hoc networksMadan Golla
 
Collaboration in multi cloud computing environments framework and security is...
Collaboration in multi cloud computing environments framework and security is...Collaboration in multi cloud computing environments framework and security is...
Collaboration in multi cloud computing environments framework and security is...Madan Golla
 
Cloud mo v
Cloud mo vCloud mo v
Cloud mo vMadan Golla
 
Cloud computing for mobile users can offloading computation save energy
Cloud computing for mobile users can offloading computation save energyCloud computing for mobile users can offloading computation save energy
Cloud computing for mobile users can offloading computation save energyMadan Golla
 
A system to filter unwanted messages from the
A system to filter unwanted messages from theA system to filter unwanted messages from the
A system to filter unwanted messages from theMadan Golla
 
A generalized flow based method for analysis of implicit
A generalized flow based method for analysis of implicitA generalized flow based method for analysis of implicit
A generalized flow based method for analysis of implicitMadan Golla
 

More from Madan Golla (9)

Rise
RiseRise
Rise
 
Optimizing cloud resources for delivering iptv services through virtualization
Optimizing cloud resources for delivering iptv services through virtualizationOptimizing cloud resources for delivering iptv services through virtualization
Optimizing cloud resources for delivering iptv services through virtualization
 
Enabling dynamic data and indirect mutual trust for cloud computing storage s...
Enabling dynamic data and indirect mutual trust for cloud computing storage s...Enabling dynamic data and indirect mutual trust for cloud computing storage s...
Enabling dynamic data and indirect mutual trust for cloud computing storage s...
 
Emap expedite message authentication protocol for vehicular ad hoc networks
Emap expedite message authentication protocol     for vehicular ad hoc networksEmap expedite message authentication protocol     for vehicular ad hoc networks
Emap expedite message authentication protocol for vehicular ad hoc networks
 
Collaboration in multi cloud computing environments framework and security is...
Collaboration in multi cloud computing environments framework and security is...Collaboration in multi cloud computing environments framework and security is...
Collaboration in multi cloud computing environments framework and security is...
 
Cloud mo v
Cloud mo vCloud mo v
Cloud mo v
 
Cloud computing for mobile users can offloading computation save energy
Cloud computing for mobile users can offloading computation save energyCloud computing for mobile users can offloading computation save energy
Cloud computing for mobile users can offloading computation save energy
 
A system to filter unwanted messages from the
A system to filter unwanted messages from theA system to filter unwanted messages from the
A system to filter unwanted messages from the
 
A generalized flow based method for analysis of implicit
A generalized flow based method for analysis of implicitA generalized flow based method for analysis of implicit
A generalized flow based method for analysis of implicit
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 

Recently uploaded (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 

Document clustering for forensic analysis an approach for improving computer inspection

  • 1. Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection
  • 2. ABSTRACT: In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. The present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations.
  • 3. Introction To DataMining •Data mining refers to as extracting knowledge from the data •It is the computational process of discovering patterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, and database systems. •The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
  • 4. Clustering-Introduction What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
  • 5. Forensic Computing-Introduction •Digital forensics is a branch of forensic science encompassing The recovery and investigation of material found in digital devices Or often in relation to computer crime. •Computer forensics is the application of investigation and analysis techniques To gather and preserve them.
  • 6. Text Mining-Definition •Text mining also refered to as text data mining,roughly equivalent to text analysis •Text mining is the analysis of data contained in natural language text •The application of text mining techniques to solve problems is called text analysis
  • 7. EXISTING SYSTEM: •Clustering algorithms are typically used for exploratory data analysis •Where there is little or no prior knowledge about the data •More precisely, it is likely that the new data sample would come from a different population •By doing so ,one can avoid the hard task of examing all the documents(individually) but ,even if so desired, it still could be done
  • 8. DISADVANTAGES OF EXISTING SYSTEM: •The literature on Computer Forensics only reports the use of algorithms that assume that the number of clusters is known and fixed a priori by the user. •A common approach in other domains involves estimating the number of clusters from data.
  • 9. K-means Algorithm Definition •The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. •It assumes that the k clusters exhibit Gaussian distributions. •It assumes that the object attributes form a vector space. •The objective it tries to achieve is to minimize total intra-cluster variance.
  • 10. K-means example •Problem: Cluster the following eight points (with (x, y) representing locations) •into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). •Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). •The distance function between two points a=(x1, y1) and b=(x2, y2) •is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| . •Use k-means algorithm to find the three cluster centers after the second iteration.
  • 11. Solution: A1 Point (2, 10) A2 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (1, 2) Dist Mean 3 (8, 4) A4 (5, 8) Dist Mean 2 (2, 5) A3 (2, 10) Dist Mean 1 Cluster (4, 9) First we list all points in the first column of the table above. The initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point (2, 10) to each of the three means, by using the distance function:
  • 12. Solution:(Cont..) point mean1 x1, y1 x2, y2 (2, 10) (2, 10) ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(point, mean1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| =0+0 =0 point mean2 x1, y1 x2, y2 (2, 10) (5, 8) ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1| = |5 – 2| + |8 – 10| =3+2 =5 point mean3 x1, y1 x2, y2 (2, 10) (1, 2) ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1| = |1 – 2| + |2 – 10| =1+8 =9
  • 13. Analogically, we fill in the rest of the table, and place each point in one of the clusters: (2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) 12 7 9 2 A4 (5, 8) 5 0 10 2 A5 (7, 5) 10 5 9 2 A6 (6, 4) 10 5 7 2 A7 (1, 2) 9 10 0 3 A8 (4, 9) 3 2 10 2
  • 14. PROPOSED SYSTEM: •I present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. •Clustering algorithms have been studied for decades, and the literature on the subject is huge. •We decided to choose a set of (six) representative algorithm in order to show the potential of the proposed approach, namely: the partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble algorithm known as CSPA. •In order to make the comparative analysis of the algorithms more realistic, two relative validity indexes have been used to estimate the number of clusters automatically ZCfrom data.
  • 15. ADVANTAGES OF PROPOSED SYSTEM: •Evaluation of the proposed approach in applications show that it has the potential to speed up the computer inspection process.
  • 16. ESTIMATING THE NUMBER OF CLUSTERS FROM DATA A widely used relative validity index is the so-called silhouette Let us consider an object belonging to cluster . The average dissimilarity of to all other objects of A is denoted by a(i) . Now let us take into account cluster C. The average dissimilarity of i to all objects of C will be called d(i,C) . After computing d(i,C) for all clusters C!=A, the smallest one is selected, i.e.,b(i)=min d(i,C) ,C!=A . This value represents the dissimilarity of to its neighbor cluster, and the silhouette for a give object, s(i), is given by: Thus, the higher the better the assignment of object to a given cluster
  • 17. CONCLUSION •With the proposed concept we can estimate the number of clusters automatically to achive get good results.
  • 19. ?