SlideShare a Scribd company logo
1 of 9
Download to read offline
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
DOI:10.5121/ijcsit.2015.7306 73
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT
MINING
Deepa Patil and Yashwant Dongre
Department of Computer Engineering, VIIT, Pune, India
ABSTRACT
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc.
occupying considerable amount of cyber space, organizing these documents has become a practical need.
Clustering is an important technique that organizes large number of objects into smaller coherent groups.
This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.
Email is one of the most frequently used e-document by individual or organization. Email categorization is
one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and
maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails,
feature appears in only one email and feature appears in none of the emails. The potency of suggested
similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the
efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
KEYWORDS
Similarity Measure, Clustering Algorithm, Document Clustering, Email Mining
1. INTRODUCTION
We are living in cyber world dumped with whole lot of information. Efficient and accurate
retrieval of the information is important for survival of many web portals. Text processing is an
important aspect of information retrieval, data mining and web search. It is equally important to
cluster the emails into different groups, so as to retrieve the similar emails i.e. email containing
search string or key words. We can further group the mails based on time stamp for more easy
retrieval.
As the amount of digital documents has been increasing dramatically over the years with the
growth of internet, information management, search and retrieval have become an important
concern. Similarity measure plays an important role in text classification and clustering algorithm.
Many a times, a bag of words model [1], [2], [3] is used in information retrieval or text processing
task, wherein a document is modelled as a collection of unique words that it has and their
frequencies. The order is completely ignored. The importance of a word in the document can be
decided based on term frequency (number of times a particular a word appears in the document),
relative term frequency (it is ratio between term frequency and total number of occurrences of all
terms in the document set) or tf-idf(it is combination of term frequency and inverse document
frequency)[4]. All documents get converted into a matrix, where each word adds a new
dimension, being row of matrix and each document is represented as a column vector. This
implies that each entry in the matrix gives the frequency of world occurring in a particular
document. It is easy to see that the matrix would be sparse. Higher the frequency of each word, it
is more descriptive of the document.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
74
Clustering is a technique that organizes large number of objects into smaller coherent groups.
Clustering aims at grouping similar documents in one group and separate this group as much as
possible from the one which contains information on entirely different topics. Clustering
algorithms [4] requires a metric to quantify how similar or different two given documents are.
This difference is often measured by some distance measures. These measures are called as
Similarity Measures. Clustering requires definition of a distance measure which assigns a numeric
value to the extent of difference between two documents and which the clustering algorithm uses
for making different groups of a given dataset. Similarity measure plays an important role in text
classification and clustering.
A lot of similarity measures are in existence to compute similarity between two documents.
Euclidian distance [5] is one of the popular similarity metric taken from Euclidian geometry field.
Cosine similarity [4] is a measure which takes cosine of the angle between two vectors. The
Jaccard coefficient [6] is a statistic used for comparing the similarity of two document sets and is
defined as size of intersection divided by size of union on sample data sets. An information-
theoretic measure for document similarity called IT_Sim [7], [8] is a phrase-based measure which
computes the similarity based on Suffix Tree Document Model. Pairwise-adaptive similarity [9]
is a measure which selects a number of features dynamically out of document d1 and document
d2. In [7], [10] Hamming distance is used; hamming distance between two document vectors is
number of positions where the corresponding symbols differ. The Kullback-Leibler divergence
[11] is a non-symmetric measure of difference between probability distributions associated with
two vectors.
Many a times, it is essential to search email contents to retrieve mails containing similar contents
or key world .It is essential equally to search email body or content to put them into different
groups or to search and bring up the emails that contains the search string or search keywords.
This feature will be very useful in business domain. For example, if an employee in an
organization wants to retrieve emails containing information about recent sells occurred in a
particular area, he/she can specify the search string and search options accordingly, and fetch the
emails containing required information.
My work focuses on implementing k-means clustering algorithm [12], [13] along with similarity
measure (SMTP) Similarity Measure for Text Processing [14] on email data set to categorize
emails into different groups. The main purpose of this work is to test effectiveness of SMTP used
with k- means clustering algorithm for email clustering.
SMTP has many advantages. SMTP considers presence or absence of features than difference
between two values associated with present feature. It also considers that similarity degree should
increase when difference between two non-zero values of a specific feature decreases. It also
takes into account that similarity degree should decrease when the number of presence-absence
features increases. SMTP takes into consideration one more important aspect that two documents
are least similar to each other if none of the features have non-zero values in both documents.
SMTP is symmetric similarity measure. The last and most important fact is that it considers
standard deviation of feature taken into count for its contribution to similarity between two
documents.
The rest of the paper is organized as follows. Related work is discussed in short in section 2
.Proposed system is described in section 3. Experimental results are presented in section 4.
Finally conclusion is given in section 5.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
75
2. RELATED WORK
Clustering is one of the data mining methods which is used for the purpose of email mining.
Clustering is used for grouping emails for the purpose easy management .Commonly used
clustering algorithms for email grouping are Hierarchical clustering [15] and k-means clustering
algorithm [12],[13]. Closeness of any two emails can be determined by any distance measures
such as Euclidian distance, Cosine similarity, Pairwise adaptive similarity, Jaccard coefficient,
Dice coefficient etc.
The Euclidian distance measure [5] is defined as root of square differences between respective
coordinates of d1 and d2 i.e.
, = p1 − q1 + p1 − q1 + ⋯ + pn − qn
																																																	= pi − qi (1)
Cosine similarity [4] measures the cosine of the angle between d1 and d2 as
= cos $ =
%.'
(|%|(	||'||
=
∑ %+∗'+-
./0
%+ 1-
./0
∗ '+ 1-
./0
(2)
The resulting similarity ranges from −1 means exactly opposite, to 1 means exactly the same,
with 0 usually indicating independence, and in-between values indicating intermediate similarity
or dissimilarity.
The formula for Jaccard coefficient [6] for data processing is:
SJ = a/ (a + b + c), where (3)
SJ = Jaccard similarity coefficient
a = number of terms common to (shared by) both documents
b = number of terms unique to the first document
c = number of terms unique to the second document
Jaccard coefficient uses presence/absence data.
Dice coefficient is similar to Jaccard's index
Dice coefficient also uses presence/absence data and is given as:-
SS = 2a/ (2a + b + c), where (4)
SS = Dice similarity coefficient
a = number of terms common to (shared by) both documents
b = number of terms unique to the first document
c = number of terms unique to the second document
IT_Sim [7], [8] is a phrase-based measure to compute the similarity based on Suffix Tree
Document Model. Pairwise-adaptive similarity [9] dynamically selects a number of features out
of d1 and d2. Hamming distance [7], [10] between two vectors is the number of positions at
which the corresponding symbols are different. The Kullback-Leibler divergence [11] is a non-
symmetric measure of difference between probability distributions associated with two vectors.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
76
Studies have shown that above mentioned similarity measures does not give optimal results for
text classification. So a measure known as (SMTP) Similarity Measure for Text classification
[14] is used for email categorization along with k-means clustering algorithm.
3. PROPOSED SYSTEM
Figure 1. Proposed System
Email clustering will be carried on, on email data set.
Step 1:- Email data is pre-processed. Following are steps for pre-processing-
1. Removal of stopwords
2. Stemming process
Step 2:- Keywords are identified.
Step 3:- Term frequency is calculated
Step 4:-Similarity calculation uses distance measure or similarity function
Step 5:- Document clustering is done using k-means clustering algorithm
The email consists of structured information such as email header and unstructured information
such as subject and body of the email. Text processing is done on unstructured information
available in the message. Pre-processing of raw data is the first step for any email management
task. In this step the email header, email body and attachments are parsed. From the parsed data,
subject and email content or other fields are extracted. Once the required text is obtained, in our
case it is email content or email body part, we need to remove stop words such as ‘the’, ’for’, ‘of’
etc. from the data obtained.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
77
Next, stemming algorithm can be used to stem the data. For example, words ‘connection’,
’connecting’, ’connected’ will be converted to ‘connect’. After stopwords removal and stemming
task, keywords or terms are identified. We can consider those nouns or pronouns which have
higher frequency of occurrence. Once the data is extracted from the email, it is represented in
some format. The most prevalent model for representation is the vector space model. In this
model every email message is represented by a single vector and each element as a token or
feature. In such type of data the tokens are usually words or phrases. These tokens can be
categorized mainly into three categories- Unigram, Bigram and Co occurrence. Unigram are
significant individual words. For example, in a data such as “Good Morning, my dear friends” the
unigrams are ‘good’, ‘morning’, ‘my’, ‘dear’ and ‘friends’. Bigram are pair of two adjacent
words. For example, in a data such as “hello friends, how are you?” the bigrams are ‘hello
friends’, ‘friends how’, ‘how are’ and ‘are you’. In bigrams, word sequence is important, ‘hello
friends’ and ‘friends hello’ are two different units. Co-occurrences are same as bigrams, only
difference is, word sequence is not important. For example, ‘hello friends’ and ‘friend’s hello’ are
treated as single unit as their order does not matter. There is another feature called target co
occurrence which is same as co-occurrence with one target word inside each pair.
Once the term frequency is calculated, document vector is generated. For each email document,
individual document vector is generated. Using document vector and similarity measure,
similarity is calculated between email documents. The most similar documents are clustered
together using clustering algorithm.
4. EXPERIMENTAL RESULTS
The effectiveness of email clustering using (SMTP) Similarity Measure for Text Processing [14],
is tested by implementing k-means clustering algorithm with SMTP. The results are compared
with k- means clustering algorithm implemented using other similarity measures- Euclidian
distance, cosine similarity, extended jaccard coefficient and dice coefficient. We observed that,
the results obtained by using SMTP with k-means clustering algorithm are better than other
similarity measures. For implementation we used Intel(R) Core(TM) i3 processor 1.70 GHz with
4 GB RAM. The entire implementation is done using object oriented programming language
Java.
4.1 Data set
For experimental purpose we used Enron email dataset which is the most popular email dataset
and is available free on World Wide Web and can be downloaded from [16] .This dataset is
cleaned and contains large set of email messages organized in folders and contains thousands of
messages belonging to almost 150 users. For each user, one folder is allocated by the name of that
user. Each of such folders contains subfolders such as inbox, sent, sent-items, drafts and other
user created sub folders
.
The experimental results are based on emails contained inbox folder.
At initial stage, clusters are formed manually on the basis of similarity. For experimental purpose,
four clusters are created and emails with similarity are put into four clusters.
Clusters generated by our system using k-means clustering algorithm are saved in a folder
dynamically. Then this system generated clusters are compared with manually created clusters.
The clusters returned by the system may not contain all the relevant mails; the clusters may
contain some irrelevant mail. For example, out of 40 mails in a particular cluster returned by the
system , only 30 mails would be relevant, means in real sense those 30 mails have similarity
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
78
among themselves based on the mail content. Finally relevant mails in the system generated
clusters are identified and accuracy is calculated as follows-
Suppose manually generated cluster contain emails {11,12,13,14}
System generated cluster contain emails{11,12,15}
This means matching contents are 2, so matchCounter is 2.
Accuracy=( matchCounter / (length of manually created cluster + length of system generated
cluster - matchCounter) ) *100
Substituting values in above formula, accuracy is calculated as –
( 2 / (4 + 3 - 2 ) ) *100=40%
4.2 Figures and Tables
Following table shows accuracy obtained for different clusters for five similarity measures.
Overall results show that similarity measure SMTP retrieves maximum relevant emails.
Table 1. Accuracy obtained for different clusters for five similarity measures
Similarity Measure Cluster 1 Cluster 2 Cluster 3 Cluster 4
Accuracy By Cosine Similarity 36.1111 6.4516 20.8333 28.5714
Accuracy By Euclidian Dist 24.6575 25.00 2.38095 20.00
Accuracy By Dice Coefficient 23.3333 8.8235 23.9130 3.8461
Accuracy By Extended Jaccard
Coefficient
24.2424 20.00 27.4509 26.3157
Accuracy By SMTP 53.8461 23.0769 83.3333 31.5789
Following graphs show, the accuracy in percentage obtained for different clusters for five
similarity measures.
Figure 2. Accuracy for cluster 1 Figure 3. Accuracy for cluster 2
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
79
Figure 4. Accuracy for cluster 3 Figure 5. Accuracy for cluster 4
Following screen shot shows email clusters obtained using SMTP.
Figure 6. Output for k- means Clustering Algorithm with SMTP
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
80
5. CONCLUSION
The effectiveness of email clustering is tested using k-means clustering algorithm with similarity
measure SMTP. From the experimental results we have observed that SMTP gives better results
for document (email) clustering than any other similarity measure used for the experimental
purpose. This is mainly because SMTP considers presence or absence of features than difference
between two values associated with present feature. So we can conclude that k-means clustering
algorithm implemented with similarity measure SMTP gives best results for email (document)
clustering.
On concluding note we suggest that, it would be very interesting to test how SMTP works with
other clustering algorithms such as fuzzy c means clustering algorithm. Also it would be worth,
analysing the SMTP on classification algorithms such as KNN.
ACKNOWLEDGEMENT
The authors wish to thank everyone who has contributed to this work directly or indirectly.
REFERENCES
[1] T. Joachims , (1997) “A Probabilistic analysis of the rocchio algorithm with TFIDF for text
categorization”, in Proc. 14th
Int.Conf. Mach. Learn. ,San Francisco ,CA,USA, pp.143-151.
[2] H Kim, P. Howland & H. Park , (2005) “Dimension reduction in text classification with support
vector machines”, J. Mach. Learn. Res., Vol. 6 pp. 37-53.
[3] G. Salton & M. J. McGill, (1983) Introduction to Modern Retrieval. London, U. K.: McGraw-Hill.
[4] J. Han & M. Kamber ,(2006) Data Mining Concepts and Techniques, 2nd
ed. San Francisco ,CA,
USA: Elsevier.
[5] T. W. Schoenharl & G. Madey , (2008) “Evaluation of measurement techniques for the validation of
agent-based simulations against streaming data “,in Proc. ICCS, Krakow, Poland.
[6] C.G. Gonzalez , W. Bonventi, Jr. & A.L.V. Rodrigues, (2008) “Density of closed balls in real-valued
and autometrized Boolean spaces for clustering applications”, in Proc. 19th
Brazilizn Symp. Artif.
Intel., Savador, Brazil, pp. 8-22.
[7] J. A. Aslam & M. Frost, (2003) “An information-theoretic measure for document similarity”, in Proc.
26th
SIGIR, Toronto, ON, Canada, pp. 449-450.
[8] D. Lin, (1998) “An information theoretic definition of similarity”, in Proc. 15th
Int. Conf. Mach.
Learn., San Francisco, CA,USA.
[9] J. D’hondt, J. Vertommen, P.A. Verhaegen, D. Cattrysse & R.J. Duflou, (2010) “Pairwise-adaptive
dissimilarity measure for document clustering”, Inf. Sci., Vol. 180, No. 12, pp. 2341-2358.
[10] R.W. Hamming, (1950) “Error detecting and error correcting codes”, Bell Syst. Tech. J., Vol. 29, No.
2, pp. 147-160.
[11] S. Kullback & R.A.Leibler, (1951) “On information and sufficiency”, Annu. Math. Statist.,Vol. 22,
No. 1, pp. 79-86.
[12] G. H. Ball & D. J. Hall , (1967) “A clustering techniques for summarizing multivariate data”, Behav.
Sci., Vol. 12, No. 2, pp. 153-155.
[13] R. O. Duda, P. E. Hart & D. J. Stork, (2001) Pattern Recognition, New York, NY, USA: Wiley
[14] Yung-Shen Lin, Jung-Yi Jiang & Shie-Jue Lee, (2014) “Similarity Measure for Text Classification
and clustering”, IEEE Transactions on Knowledge and Data Engineering , Vol. 26, No. 7.
[15] M. B. Eisen , P. T. S Spellman, P. O. Brown & D. Boststein, (1998) “A cluster analysis and display of
genome-wide expression patterns”, Sci., Vol. 95, No. 25, pp. 14863-14868.
[16] https://www.cs.cmu.edu/~./enron/
International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
81
Authors
Ms. Deepa B. Patil is a post graduate student in Computer Engineering from Savitribai
Phule Pune University, Pune , Maharashtra State, India
Prof. Yashwant V. Dongre is Assistant Professor in Vishwakarma Institute of Information
Technology (VIIT) at Savitribai Phule Pune University, Pune, Maharashtra State, India.
His area of interest includes database management, data mining and information retrieval.
He has several journal papers to his credit published in prestigious journals which
includes AIRCCE IJDMS.

More Related Content

What's hot

Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGA FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Correlation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text ClusteringCorrelation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text ClusteringIOSR Journals
 

What's hot (17)

L0261075078
L0261075078L0261075078
L0261075078
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGA FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Correlation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text ClusteringCorrelation Preserving Indexing Based Text Clustering
Correlation Preserving Indexing Based Text Clustering
 

Viewers also liked

A rule based approach towards detecting human temperament
A rule based approach towards detecting human temperamentA rule based approach towards detecting human temperament
A rule based approach towards detecting human temperamentijcsit
 
Top 7 legal assistant interview questions answers
Top 7 legal assistant interview questions answersTop 7 legal assistant interview questions answers
Top 7 legal assistant interview questions answersjob-interview-questions
 
An iterative morphological decomposition algorithm for reduction of skeleton ...
An iterative morphological decomposition algorithm for reduction of skeleton ...An iterative morphological decomposition algorithm for reduction of skeleton ...
An iterative morphological decomposition algorithm for reduction of skeleton ...ijcsit
 
P REDICTION F OR S HORT -T ERM T RAFFIC F LOW B ASED O N O PTIMIZED W...
P REDICTION  F OR  S HORT -T ERM  T RAFFIC  F LOW  B ASED  O N  O PTIMIZED  W...P REDICTION  F OR  S HORT -T ERM  T RAFFIC  F LOW  B ASED  O N  O PTIMIZED  W...
P REDICTION F OR S HORT -T ERM T RAFFIC F LOW B ASED O N O PTIMIZED W...ijcsit
 
One dimensional vector based pattern
One dimensional vector based patternOne dimensional vector based pattern
One dimensional vector based patternijcsit
 
Top 7 marketing coordinator interview questions answers
Top 7 marketing coordinator interview questions answersTop 7 marketing coordinator interview questions answers
Top 7 marketing coordinator interview questions answersjob-interview-questions
 
Top 7 hr generalist interview questions answers
Top 7 hr generalist interview questions answersTop 7 hr generalist interview questions answers
Top 7 hr generalist interview questions answersjob-interview-questions
 
Top 7 maintenance technician interview questions answers
Top 7 maintenance technician interview questions answersTop 7 maintenance technician interview questions answers
Top 7 maintenance technician interview questions answersjob-interview-questions
 
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...ijcsit
 
Japán mobilhálózatok1
Japán mobilhálózatok1Japán mobilhálózatok1
Japán mobilhálózatok1gyeszter
 
Multi objective predictive control a solution using metaheuristics
Multi objective predictive control  a solution using metaheuristicsMulti objective predictive control  a solution using metaheuristics
Multi objective predictive control a solution using metaheuristicsijcsit
 

Viewers also liked (12)

A rule based approach towards detecting human temperament
A rule based approach towards detecting human temperamentA rule based approach towards detecting human temperament
A rule based approach towards detecting human temperament
 
Top 7 legal assistant interview questions answers
Top 7 legal assistant interview questions answersTop 7 legal assistant interview questions answers
Top 7 legal assistant interview questions answers
 
An iterative morphological decomposition algorithm for reduction of skeleton ...
An iterative morphological decomposition algorithm for reduction of skeleton ...An iterative morphological decomposition algorithm for reduction of skeleton ...
An iterative morphological decomposition algorithm for reduction of skeleton ...
 
P REDICTION F OR S HORT -T ERM T RAFFIC F LOW B ASED O N O PTIMIZED W...
P REDICTION  F OR  S HORT -T ERM  T RAFFIC  F LOW  B ASED  O N  O PTIMIZED  W...P REDICTION  F OR  S HORT -T ERM  T RAFFIC  F LOW  B ASED  O N  O PTIMIZED  W...
P REDICTION F OR S HORT -T ERM T RAFFIC F LOW B ASED O N O PTIMIZED W...
 
One dimensional vector based pattern
One dimensional vector based patternOne dimensional vector based pattern
One dimensional vector based pattern
 
Top 7 marketing coordinator interview questions answers
Top 7 marketing coordinator interview questions answersTop 7 marketing coordinator interview questions answers
Top 7 marketing coordinator interview questions answers
 
Top 7 hr generalist interview questions answers
Top 7 hr generalist interview questions answersTop 7 hr generalist interview questions answers
Top 7 hr generalist interview questions answers
 
Top 7 maintenance technician interview questions answers
Top 7 maintenance technician interview questions answersTop 7 maintenance technician interview questions answers
Top 7 maintenance technician interview questions answers
 
IT Governance
IT GovernanceIT Governance
IT Governance
 
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...
INTEGRATIONS OF ICT IN EDUCATION SECTOR FOR THE ADVANCEMENT OF THE DEVELOPING...
 
Japán mobilhálózatok1
Japán mobilhálózatok1Japán mobilhálózatok1
Japán mobilhálózatok1
 
Multi objective predictive control a solution using metaheuristics
Multi objective predictive control  a solution using metaheuristicsMulti objective predictive control  a solution using metaheuristics
Multi objective predictive control a solution using metaheuristics
 

Similar to A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING

Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Optimal approach for text summarization
Optimal approach for text summarizationOptimal approach for text summarization
Optimal approach for text summarizationIAEME Publication
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...ijsrd.com
 
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailText Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailijsrd.com
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
 
Reviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringReviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringIRJET Journal
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
20433-39028-3-PB.pdf
20433-39028-3-PB.pdf20433-39028-3-PB.pdf
20433-39028-3-PB.pdfIjictTeam
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...ijseajournal
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsijdms
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringeSAT Journals
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringeSAT Publishing House
 

Similar to A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING (20)

Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Optimal approach for text summarization
Optimal approach for text summarizationOptimal approach for text summarization
Optimal approach for text summarization
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
F017243241
F017243241F017243241
F017243241
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...
 
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailText Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
E1062530
E1062530E1062530
E1062530
 
Reviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringReviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clustering
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
20433-39028-3-PB.pdf
20433-39028-3-PB.pdf20433-39028-3-PB.pdf
20433-39028-3-PB.pdf
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
H04564550
H04564550H04564550
H04564550
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 

Recently uploaded

EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 

A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING

  • 1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 DOI:10.5121/ijcsit.2015.7306 73 A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING Deepa Patil and Yashwant Dongre Department of Computer Engineering, VIIT, Pune, India ABSTRACT In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups. This helps in efficient and effective use of these documents for information retrieval and other NLP tasks. Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering. The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures. KEYWORDS Similarity Measure, Clustering Algorithm, Document Clustering, Email Mining 1. INTRODUCTION We are living in cyber world dumped with whole lot of information. Efficient and accurate retrieval of the information is important for survival of many web portals. Text processing is an important aspect of information retrieval, data mining and web search. It is equally important to cluster the emails into different groups, so as to retrieve the similar emails i.e. email containing search string or key words. We can further group the mails based on time stamp for more easy retrieval. As the amount of digital documents has been increasing dramatically over the years with the growth of internet, information management, search and retrieval have become an important concern. Similarity measure plays an important role in text classification and clustering algorithm. Many a times, a bag of words model [1], [2], [3] is used in information retrieval or text processing task, wherein a document is modelled as a collection of unique words that it has and their frequencies. The order is completely ignored. The importance of a word in the document can be decided based on term frequency (number of times a particular a word appears in the document), relative term frequency (it is ratio between term frequency and total number of occurrences of all terms in the document set) or tf-idf(it is combination of term frequency and inverse document frequency)[4]. All documents get converted into a matrix, where each word adds a new dimension, being row of matrix and each document is represented as a column vector. This implies that each entry in the matrix gives the frequency of world occurring in a particular document. It is easy to see that the matrix would be sparse. Higher the frequency of each word, it is more descriptive of the document.
  • 2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 74 Clustering is a technique that organizes large number of objects into smaller coherent groups. Clustering aims at grouping similar documents in one group and separate this group as much as possible from the one which contains information on entirely different topics. Clustering algorithms [4] requires a metric to quantify how similar or different two given documents are. This difference is often measured by some distance measures. These measures are called as Similarity Measures. Clustering requires definition of a distance measure which assigns a numeric value to the extent of difference between two documents and which the clustering algorithm uses for making different groups of a given dataset. Similarity measure plays an important role in text classification and clustering. A lot of similarity measures are in existence to compute similarity between two documents. Euclidian distance [5] is one of the popular similarity metric taken from Euclidian geometry field. Cosine similarity [4] is a measure which takes cosine of the angle between two vectors. The Jaccard coefficient [6] is a statistic used for comparing the similarity of two document sets and is defined as size of intersection divided by size of union on sample data sets. An information- theoretic measure for document similarity called IT_Sim [7], [8] is a phrase-based measure which computes the similarity based on Suffix Tree Document Model. Pairwise-adaptive similarity [9] is a measure which selects a number of features dynamically out of document d1 and document d2. In [7], [10] Hamming distance is used; hamming distance between two document vectors is number of positions where the corresponding symbols differ. The Kullback-Leibler divergence [11] is a non-symmetric measure of difference between probability distributions associated with two vectors. Many a times, it is essential to search email contents to retrieve mails containing similar contents or key world .It is essential equally to search email body or content to put them into different groups or to search and bring up the emails that contains the search string or search keywords. This feature will be very useful in business domain. For example, if an employee in an organization wants to retrieve emails containing information about recent sells occurred in a particular area, he/she can specify the search string and search options accordingly, and fetch the emails containing required information. My work focuses on implementing k-means clustering algorithm [12], [13] along with similarity measure (SMTP) Similarity Measure for Text Processing [14] on email data set to categorize emails into different groups. The main purpose of this work is to test effectiveness of SMTP used with k- means clustering algorithm for email clustering. SMTP has many advantages. SMTP considers presence or absence of features than difference between two values associated with present feature. It also considers that similarity degree should increase when difference between two non-zero values of a specific feature decreases. It also takes into account that similarity degree should decrease when the number of presence-absence features increases. SMTP takes into consideration one more important aspect that two documents are least similar to each other if none of the features have non-zero values in both documents. SMTP is symmetric similarity measure. The last and most important fact is that it considers standard deviation of feature taken into count for its contribution to similarity between two documents. The rest of the paper is organized as follows. Related work is discussed in short in section 2 .Proposed system is described in section 3. Experimental results are presented in section 4. Finally conclusion is given in section 5.
  • 3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 75 2. RELATED WORK Clustering is one of the data mining methods which is used for the purpose of email mining. Clustering is used for grouping emails for the purpose easy management .Commonly used clustering algorithms for email grouping are Hierarchical clustering [15] and k-means clustering algorithm [12],[13]. Closeness of any two emails can be determined by any distance measures such as Euclidian distance, Cosine similarity, Pairwise adaptive similarity, Jaccard coefficient, Dice coefficient etc. The Euclidian distance measure [5] is defined as root of square differences between respective coordinates of d1 and d2 i.e. , = p1 − q1 + p1 − q1 + ⋯ + pn − qn = pi − qi (1) Cosine similarity [4] measures the cosine of the angle between d1 and d2 as = cos $ = %.' (|%|( ||'|| = ∑ %+∗'+- ./0 %+ 1- ./0 ∗ '+ 1- ./0 (2) The resulting similarity ranges from −1 means exactly opposite, to 1 means exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. The formula for Jaccard coefficient [6] for data processing is: SJ = a/ (a + b + c), where (3) SJ = Jaccard similarity coefficient a = number of terms common to (shared by) both documents b = number of terms unique to the first document c = number of terms unique to the second document Jaccard coefficient uses presence/absence data. Dice coefficient is similar to Jaccard's index Dice coefficient also uses presence/absence data and is given as:- SS = 2a/ (2a + b + c), where (4) SS = Dice similarity coefficient a = number of terms common to (shared by) both documents b = number of terms unique to the first document c = number of terms unique to the second document IT_Sim [7], [8] is a phrase-based measure to compute the similarity based on Suffix Tree Document Model. Pairwise-adaptive similarity [9] dynamically selects a number of features out of d1 and d2. Hamming distance [7], [10] between two vectors is the number of positions at which the corresponding symbols are different. The Kullback-Leibler divergence [11] is a non- symmetric measure of difference between probability distributions associated with two vectors.
  • 4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 76 Studies have shown that above mentioned similarity measures does not give optimal results for text classification. So a measure known as (SMTP) Similarity Measure for Text classification [14] is used for email categorization along with k-means clustering algorithm. 3. PROPOSED SYSTEM Figure 1. Proposed System Email clustering will be carried on, on email data set. Step 1:- Email data is pre-processed. Following are steps for pre-processing- 1. Removal of stopwords 2. Stemming process Step 2:- Keywords are identified. Step 3:- Term frequency is calculated Step 4:-Similarity calculation uses distance measure or similarity function Step 5:- Document clustering is done using k-means clustering algorithm The email consists of structured information such as email header and unstructured information such as subject and body of the email. Text processing is done on unstructured information available in the message. Pre-processing of raw data is the first step for any email management task. In this step the email header, email body and attachments are parsed. From the parsed data, subject and email content or other fields are extracted. Once the required text is obtained, in our case it is email content or email body part, we need to remove stop words such as ‘the’, ’for’, ‘of’ etc. from the data obtained.
  • 5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 77 Next, stemming algorithm can be used to stem the data. For example, words ‘connection’, ’connecting’, ’connected’ will be converted to ‘connect’. After stopwords removal and stemming task, keywords or terms are identified. We can consider those nouns or pronouns which have higher frequency of occurrence. Once the data is extracted from the email, it is represented in some format. The most prevalent model for representation is the vector space model. In this model every email message is represented by a single vector and each element as a token or feature. In such type of data the tokens are usually words or phrases. These tokens can be categorized mainly into three categories- Unigram, Bigram and Co occurrence. Unigram are significant individual words. For example, in a data such as “Good Morning, my dear friends” the unigrams are ‘good’, ‘morning’, ‘my’, ‘dear’ and ‘friends’. Bigram are pair of two adjacent words. For example, in a data such as “hello friends, how are you?” the bigrams are ‘hello friends’, ‘friends how’, ‘how are’ and ‘are you’. In bigrams, word sequence is important, ‘hello friends’ and ‘friends hello’ are two different units. Co-occurrences are same as bigrams, only difference is, word sequence is not important. For example, ‘hello friends’ and ‘friend’s hello’ are treated as single unit as their order does not matter. There is another feature called target co occurrence which is same as co-occurrence with one target word inside each pair. Once the term frequency is calculated, document vector is generated. For each email document, individual document vector is generated. Using document vector and similarity measure, similarity is calculated between email documents. The most similar documents are clustered together using clustering algorithm. 4. EXPERIMENTAL RESULTS The effectiveness of email clustering using (SMTP) Similarity Measure for Text Processing [14], is tested by implementing k-means clustering algorithm with SMTP. The results are compared with k- means clustering algorithm implemented using other similarity measures- Euclidian distance, cosine similarity, extended jaccard coefficient and dice coefficient. We observed that, the results obtained by using SMTP with k-means clustering algorithm are better than other similarity measures. For implementation we used Intel(R) Core(TM) i3 processor 1.70 GHz with 4 GB RAM. The entire implementation is done using object oriented programming language Java. 4.1 Data set For experimental purpose we used Enron email dataset which is the most popular email dataset and is available free on World Wide Web and can be downloaded from [16] .This dataset is cleaned and contains large set of email messages organized in folders and contains thousands of messages belonging to almost 150 users. For each user, one folder is allocated by the name of that user. Each of such folders contains subfolders such as inbox, sent, sent-items, drafts and other user created sub folders . The experimental results are based on emails contained inbox folder. At initial stage, clusters are formed manually on the basis of similarity. For experimental purpose, four clusters are created and emails with similarity are put into four clusters. Clusters generated by our system using k-means clustering algorithm are saved in a folder dynamically. Then this system generated clusters are compared with manually created clusters. The clusters returned by the system may not contain all the relevant mails; the clusters may contain some irrelevant mail. For example, out of 40 mails in a particular cluster returned by the system , only 30 mails would be relevant, means in real sense those 30 mails have similarity
  • 6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 78 among themselves based on the mail content. Finally relevant mails in the system generated clusters are identified and accuracy is calculated as follows- Suppose manually generated cluster contain emails {11,12,13,14} System generated cluster contain emails{11,12,15} This means matching contents are 2, so matchCounter is 2. Accuracy=( matchCounter / (length of manually created cluster + length of system generated cluster - matchCounter) ) *100 Substituting values in above formula, accuracy is calculated as – ( 2 / (4 + 3 - 2 ) ) *100=40% 4.2 Figures and Tables Following table shows accuracy obtained for different clusters for five similarity measures. Overall results show that similarity measure SMTP retrieves maximum relevant emails. Table 1. Accuracy obtained for different clusters for five similarity measures Similarity Measure Cluster 1 Cluster 2 Cluster 3 Cluster 4 Accuracy By Cosine Similarity 36.1111 6.4516 20.8333 28.5714 Accuracy By Euclidian Dist 24.6575 25.00 2.38095 20.00 Accuracy By Dice Coefficient 23.3333 8.8235 23.9130 3.8461 Accuracy By Extended Jaccard Coefficient 24.2424 20.00 27.4509 26.3157 Accuracy By SMTP 53.8461 23.0769 83.3333 31.5789 Following graphs show, the accuracy in percentage obtained for different clusters for five similarity measures. Figure 2. Accuracy for cluster 1 Figure 3. Accuracy for cluster 2
  • 7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 79 Figure 4. Accuracy for cluster 3 Figure 5. Accuracy for cluster 4 Following screen shot shows email clusters obtained using SMTP. Figure 6. Output for k- means Clustering Algorithm with SMTP
  • 8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 80 5. CONCLUSION The effectiveness of email clustering is tested using k-means clustering algorithm with similarity measure SMTP. From the experimental results we have observed that SMTP gives better results for document (email) clustering than any other similarity measure used for the experimental purpose. This is mainly because SMTP considers presence or absence of features than difference between two values associated with present feature. So we can conclude that k-means clustering algorithm implemented with similarity measure SMTP gives best results for email (document) clustering. On concluding note we suggest that, it would be very interesting to test how SMTP works with other clustering algorithms such as fuzzy c means clustering algorithm. Also it would be worth, analysing the SMTP on classification algorithms such as KNN. ACKNOWLEDGEMENT The authors wish to thank everyone who has contributed to this work directly or indirectly. REFERENCES [1] T. Joachims , (1997) “A Probabilistic analysis of the rocchio algorithm with TFIDF for text categorization”, in Proc. 14th Int.Conf. Mach. Learn. ,San Francisco ,CA,USA, pp.143-151. [2] H Kim, P. Howland & H. Park , (2005) “Dimension reduction in text classification with support vector machines”, J. Mach. Learn. Res., Vol. 6 pp. 37-53. [3] G. Salton & M. J. McGill, (1983) Introduction to Modern Retrieval. London, U. K.: McGraw-Hill. [4] J. Han & M. Kamber ,(2006) Data Mining Concepts and Techniques, 2nd ed. San Francisco ,CA, USA: Elsevier. [5] T. W. Schoenharl & G. Madey , (2008) “Evaluation of measurement techniques for the validation of agent-based simulations against streaming data “,in Proc. ICCS, Krakow, Poland. [6] C.G. Gonzalez , W. Bonventi, Jr. & A.L.V. Rodrigues, (2008) “Density of closed balls in real-valued and autometrized Boolean spaces for clustering applications”, in Proc. 19th Brazilizn Symp. Artif. Intel., Savador, Brazil, pp. 8-22. [7] J. A. Aslam & M. Frost, (2003) “An information-theoretic measure for document similarity”, in Proc. 26th SIGIR, Toronto, ON, Canada, pp. 449-450. [8] D. Lin, (1998) “An information theoretic definition of similarity”, in Proc. 15th Int. Conf. Mach. Learn., San Francisco, CA,USA. [9] J. D’hondt, J. Vertommen, P.A. Verhaegen, D. Cattrysse & R.J. Duflou, (2010) “Pairwise-adaptive dissimilarity measure for document clustering”, Inf. Sci., Vol. 180, No. 12, pp. 2341-2358. [10] R.W. Hamming, (1950) “Error detecting and error correcting codes”, Bell Syst. Tech. J., Vol. 29, No. 2, pp. 147-160. [11] S. Kullback & R.A.Leibler, (1951) “On information and sufficiency”, Annu. Math. Statist.,Vol. 22, No. 1, pp. 79-86. [12] G. H. Ball & D. J. Hall , (1967) “A clustering techniques for summarizing multivariate data”, Behav. Sci., Vol. 12, No. 2, pp. 153-155. [13] R. O. Duda, P. E. Hart & D. J. Stork, (2001) Pattern Recognition, New York, NY, USA: Wiley [14] Yung-Shen Lin, Jung-Yi Jiang & Shie-Jue Lee, (2014) “Similarity Measure for Text Classification and clustering”, IEEE Transactions on Knowledge and Data Engineering , Vol. 26, No. 7. [15] M. B. Eisen , P. T. S Spellman, P. O. Brown & D. Boststein, (1998) “A cluster analysis and display of genome-wide expression patterns”, Sci., Vol. 95, No. 25, pp. 14863-14868. [16] https://www.cs.cmu.edu/~./enron/
  • 9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015 81 Authors Ms. Deepa B. Patil is a post graduate student in Computer Engineering from Savitribai Phule Pune University, Pune , Maharashtra State, India Prof. Yashwant V. Dongre is Assistant Professor in Vishwakarma Institute of Information Technology (VIIT) at Savitribai Phule Pune University, Pune, Maharashtra State, India. His area of interest includes database management, data mining and information retrieval. He has several journal papers to his credit published in prestigious journals which includes AIRCCE IJDMS.