Cosdes a collaborative spam detection system with a novel e mail abstraction scheme
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 669 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme Chi-Yao Tseng, Pin-Chieh Sung, and Ming-Syan Chen, Fellow, IEEE Abstract—E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is applicable to the real world. Index Terms—Spam detection, e-mail abstraction, near-duplicate matching. Ç1 INTRODUCTIONE -MAILcommunication is prevalent and indispensable nowadays. However, the threat of unsolicited junk e-mails, also known as spams, becomes more and more Although the techniques used by spammers vary constantly, there is still one enduring feature: spams with identical or similar content are sent in large quantities andserious. According to a survey by the website TopTenRE- successively. Since only a small amount of e-mail users willVIEWS , 40 percent of e-mails were considered as spams order products or visit websites advertised in spams,in 2006. The statistics collected by MessageLabs1 show that spammers have no choice but to send a great quantity ofrecently the spam rate is over 70 percent and persistently spams to make profits. It means that even with developingremains high. The primary challenge of spam detection and employing unexpected new tricks, spammers still haveproblem lies in the fact that spammers will always find new to send out large quantities of identical or similar spamsways to attack spam filters owing to the economic benefits of simultaneously and in succession. This specific feature ofsending spams. Note that existing filters generally perform spams can be designated as the near-duplicate phenomenon,well when dealing with clumsy spams, which have which is a significant key in the spam detection problem.duplicate content with suspicious keywords or are sent In view of above facts, the notion of collaborative spamfrom an identical notorious server. Therefore, the next stage filtering with near-duplicate similarity matching schemeof spam detection research should focus on coping with has recently received much attention. The primary idea ofcunning spams which evolve naturally and continuously. the near-duplicate matching scheme for spam detection is to maintain a known spam database, formed by user feedback, 1. http://www.messagelabs.com/globalthreats. to block subsequent spams with similar content. Collabora- tive filtering indicates that user knowledge of what spam. C.-Y. Tseng and M.-S. Chen are with the Department of Electrical may subsequently appear is collected to detect following Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, spams. Overall, there are three key points of this type of Taipei 10617, Taiwan, and the Research Center for Information Technology spam detection approach we have to be concerned about. Innovation (CITI), Academia Sinica, No. 128, Sec. 2, Academia Road, First, an effective representation of e-mail (i.e., e-mail Nankang, Taipei 11529, Taiwan. E-mail: firstname.lastname@example.org, email@example.com. abstraction) is essential. Since a large set of reported spams. P.-C. Sung is with the Department of Electrical Engineering, National has to be stored in the known spam database, the storage Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan. size of e-mail abstraction should be small. Moreover, the e- E-mail: firstname.lastname@example.org. mail abstraction should capture the near-duplicate phe-Manuscript received 7 Mar. 2009; revised 9 Aug. 2009; accepted 7 Dec. 2009; nomenon of spams, and should avoid accidental deletion ofpublished online 24 Aug. 2010. nonspam e-mails (also known as hams). Second, everyRecommended for acceptance by D. Cook.For information on obtaining reprints of this article, please send e-mail to: incoming e-mail has to be matched with the large database,email@example.com, and reference IEEECS Log Number TKDE-2009-03-0118. meaning that the near-duplicate matching process shouldDigital Object Identifier no. 10.1109/TKDE.2010.147. be substantially efficient. Finally, the latest spams have to be 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011included instantly and successively into the database so as match each incoming e-mail with an existing huge spamto effectively block subsequent near-duplicate spams. database. To resolve this issue, we devise an innovative tree Although previous researchers have developed various structure, SpTrees, to store large amounts of the e-mailmethods on near-duplicate spam detection , , , , abstractions of reported spams, and SpTrees contribute to, , , , , , , these works are still substantially promoting the efficiency of matching. In thesubject to some drawbacks. To achieve the objectives of design of the near-duplicate matching scheme based onsmall storage size and efficient matching, prior works SpTrees, we aim at reducing the number of spams and tagsmainly represent each e-mail by a succinct abstraction which are required to be compared.derived from e-mail content text. Moreover, hash-based text By integrating above techniques, in this paper, we designrepresentation is applied extensively. One major problem of a complete spam detection system COllaborative Spamthese abstractions is that they may be too brief and thus DEtection System (Cosdes). Cosdes possesses an efficientmay not be robust enough to withstand intentional attacks. near-duplicate matching scheme and a progressive updateA common attack to this type of representation is to insert a scheme. The progressive update scheme not only adds inrandom normal paragraph without any suspicious key-words into unobvious position of an e-mail. In such a new reported spams, but also removes obsolete ones in thecontext, if the whole e-mail content is utilized for hash- database. With Cosdes maintaining an up-to-date spambased representation, the near-duplicate part of spams database, the detection result of each incoming e-mail cancannot be captured. In addition, the false positive rate (i.e., be determined by the near-duplicate similarity matchingthe rate of classifying hams as spams) may increase because process. In addition, to withstand intentional attacks, athe random part of e-mail content is also involved in e-mail reputation mechanism is also provided in Cosdes to ensureabstraction. On the other hand, hash-based text representa- the truthfulness of user feedback.tion also suffers from the problem of not being suitable for To the best of our knowledge, there is no prior researchall languages. Finally, images and hyperlinks are important in considering e-mail layout structure to represent e-mailsclues to spam detection, but both of them are unable to be in the field of near-duplicate spam detection. In summary,included in hash-based text representation. the contributions of this paper are as follows: In this paper, we explore to devise a more sophisticated e-mail abstraction, which can more effectively capture the near- 1. We propose the specific procedure SAG to generateduplicate phenomenon of spams. Motivated by the fact that e- the e-mail abstraction using HTML content in e-mail,mail users are capable of easily recognizing similar spams by and this newly devised abstraction can moreobserving the layouts of e-mails, we attempt to represent each effectively capture the near-duplicate phenomenone-mail based on the e-mail layout structure. Fortunately, of spams.almost all e-mails nowadays are in Multipurpose Internet 2. We devise an innovative tree structure, SpTrees, toMail Extensions (MIME) format with the text/html content- store large amounts of the e-mail abstractions oftype. That is, HTML content is available in an e-mail and reported spams. SpTrees contribute to the accom-provides sufficient information about e-mail layout structure. plishment of the efficient near-duplicate matchingIn view of this observation, we propose the specific procedure with a more sophisticated e-mail abstraction. 3. We design a complete spam detection systemStructure Abstraction Generation (SAG), which generates an Cosdes with an efficient near-duplicate matchingHTML tag sequence to represent each e-mail. Different from scheme and a progressive update scheme. Theprevious works, SAG focuses on the e-mail layout structure progressive update scheme enables system Cosdesinstead of detailed content text. In this regard, each paragraph to keep the most up-to-date information for near-of text without any HTML tag embedded will be transformed duplicate detection.to a newly defined tag <mytext=>. The rest of this paper is outlined as follows: In Section 2,Definition 1 (<mytext=>). <mytext=> is a newly defined preliminaries including the definition of near-duplicate and tag that represents a paragraph of text without any HTML tag the related works are given. In Section 3, we introduce the embedded. novel e-mail abstraction scheme. In Section 4, the complete system model of Cosdes is depicted. The experimental Since we ignore the semantics of the text, the proposed results are shown in Section 5, and finally, this paper isabstraction scheme is inherently applicable to e-mails in all concluded with Section 6.languages. This significant feature is superior to mostexisting methods. Once e-mails are represented by ournewly devised e-mail abstractions, two e-mails are viewed 2 PRELIMINARIESas near-duplicate if their HTML tag sequences are exactly In this section, the definition of near-duplicate, in thisidentical to each other. Note that even when spammers insert paper, is presented in Section 2.1. We then review therandom tags into e-mails, the proposed e-mail abstraction related works on spam detection in Section 2.2.scheme will still retain efficacy since arbitrary tag insertion isprone to syntax errors or tag mismatching, meaning that the 2.1 Definition of Near-Duplicateappearance of the e-mail content will be greatly altered. The central idea of near-duplicate spam detection is to exploitMoreover, the proposed procedure SAG also adopts some reported known spams to block subsequent ones which haveheuristics to better guarantee the robustness of our approach. similar content. For different forms of e-mail representation, While a more sophisticated e-mail abstraction is intro- the definitions of similarity between two e-mails are diverse.duced, one challenging issue arises: how to efficiently Unlike most prior works representing e-mails based mainly
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 671on content text, we investigate representing each e-mail general, Naive Bayes methods train a probability modelusing an HTML tag sequence, which depicts the layout using classified e-mails, and each word in e-mails will bestructure of e-mail, and look forward to more effectively given a probability of being a suspicious spam keyword. Ascapturing the near-duplicate phenomenon of spams. Initi- for SVMs, it is a supervised learning method, whichally, the definition of <anchor> tag is given as follows: possesses outstanding performance on text classificationDefinition 2 (<anchor>). The tag <anchor> is one type of tasks. Traditional SVMs  and improved SVMs , , newly defined tag that records the domain name or the e-mail  have been investigated. While above conventional address in an anchor tag. machine learning techniques have reported excellent results with static data sets, one major disadvantage is that it is For example, the anchor tag <a href=“http://arbor.ee. cost-prohibitive for large-scale applications to constantlyntu.edu.tw/index.htm”> is transformed to <arbor.ee.ntu. retrain these methods with the latest information to adapt toedu.tw>. The anchor tag <a href=“mailto:cytseng@arbor. the rapid evolving nature of spams. The spam detection ofee.ntu.edu.tw”> is transformed to <firstname.lastname@example.org. these methods on the e-mail corpus with various languagentu.edu.tw>. The purpose of creating the <anchor> tag is has been less studied yet. In addition, other classificationto minimize the false positive rate when the number of tags techniques, including markov random field model ,in an e-mail abstraction is short. The less the number of tags neural network  and logic regression , and certainin an e-mail abstraction, the more possible that a ham may specific features, such as URLs  and images , be matched with known spams and be misclassified as a have also been taken into account for spam detection.spam. Therefore, when the number of tags in an e-mail The other group attempts to exploit noncontent informa-abstraction is smaller than a predefined threshold, for each tion such as e-mail header, e-mail social network , ,anchor tag <a>, we specifically record the targeted domain and e-mail traffic ,  to filter spams. Collectingname or e-mail address, which is a significant clue for notorious and innocent sender addresses (or IP addresses)identifying spams. from e-mail header to create black list and white list is a On the other hand, in this paper, the detailed definition commonly applied method initially. MailRank  examinesof near-duplicate is given as follows: the feasibility of rating sender addresses with algorithm PageRank in the e-mail social network, and in , aDefinition 3 (Near-Duplicate). Let I ¼ ft1 ; t2 ; . . . ; ti ; . . . ; tn ; modified version with update scheme is introduced. Since <mytext=>; <anchor>g be the set of all valid HTML tags e-mail header can be altered by spammers to conceal the with two types of newly created tags, <mytext=> and identity, the main drawback of these methods is the <anchor>, included. An e-mail abstraction derived from hardness of correctly identifying each user. In , , the procedure SAG is denoted as <e1 ; e2 ; . . . ; ei ; . . . ; em >, which authors intend to analyze e-mail traffic flows to detect is an ordered list of tags, where ei 2 I. The definition of near- suspicious machines and abnormal e-mail communication. duplicate is: “Two e-mail abstractions ¼ a1 ; a2 ; . . . ; It is noted that these approaches have to operate in ai ; . . . ; an and
¼ b1 ; b2 ; . . . ; bi ; . . . ; bm are viewed as coordination with other complementary methods to gain near-duplicate if 8ai ¼ bi and n ¼ m.” better results. Moreover, some researchers consider com-Definition 4 (Tag Length). The tag length of an e-mail bining the merits of several techniques , , . Even abstraction is defined as the number of tags in an e-mail though the performance of classifier integration seems abstraction. prominent, there is still no conclusion on what is the best combination. In addition, how to efficiently update the whole included classifiers is another unsolved issue. Note that we strictly define that two e-mail abstractions On the other hand, collaborative spam filtering withare near-duplicate only if they are exactly identical to each near-duplicate similarity matching scheme has been stu-other. The major reason is that there are numerous HTML died extensively in recent years. Regarding collaborativetag patterns appearing commonly and frequently. Partial mechanism, P2P-based architecture , , , centra-matching of HTML tag sequences will cause much higherrate of false positive error, and the complexity will be too lized server-based system , , , , and othershigh to achieve efficient matching. In addition, for further ,  are generally employed. Note that no matterspeed-up, while the tag length of an e-mail abstraction is which mechanism is applied, the most critical factor is howlonger, we even apply a looser matching criterion, which to represent each e-mail for near-duplicate matching. The e-does not degrade detection results. mail abstraction not only should capture the near-duplicate phenomenon of spams, but should avoid accidental2.2 Related Works deletion of hams. In , the first N hash values of eachSince the e-mail spam problem is increasingly serious length L substring are used as vector representation of thenowadays, various techniques have been explored to relieve e-mail. In , , , a 32-byte code derived from athe problem. Based on what features of e-mails are being variation of Nilsimsa digest technique is utilized toused, previous works on spam detection can be generally represent the distribution of word trigrams in e-mail. Inclassified into three categories: 1) content-based methods, , , , the authors improve the open digest2) noncontent-based methods, and 3) others. Initially, technique  by representing each e-mail with multipleresearchers analyze e-mail content text and model this digests produced from the strings of fixed length sampledproblem as a binary text classification task. Representatives at randomized positions within e-mail. In , , aof this category are Naive Bayes ,  and Support feature vector of a block text fingerprint generated from theVector Machines (SVMs) , , ,  methods. In set of checksums of each length L substring is exploited. In
672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 Fig. 2. An example of the preprocessing step in Tag Extraction Phase of procedure SAG. to form the tentative e-mail abstraction. Note that we retain only the first 1,023 tags as the tag sequence. The main reason is that the rear part of long e-mails can be ignored without affecting the effectiveness of near-duplicate matching. Sub- sequently, in line 6 of Fig. 1, we preprocess the tag sequence of the tentative e-mail abstraction. One objective of this preprocessing step is to remove tags that are common butFig. 1. Algorithmic form of procedure SAG. not discriminative between e-mails. The other objective is to prevent malicious tag insertion attack, and thus the robust-, the authors make use of spam-vocabulary patterns ness of the proposed abstraction scheme can be furtherproduced by Teiresias pattern discovery algorithm. In , enhanced. The following sequence of operations is performedthe I-Match signature determined by a set of unique terms in the preprocessing step.shared by spams and the I-Match lexicon is put to use. In, the content similarity of e-mails computed using 1. Front and rear tags (as shown in the gray area of theextracted words is measured. It is noted that most existing example e-mail in the top of Fig. 3) are excluded.methods generate e-mail abstractions based mainly on 2. Nonempty tags2 that have no corresponding startcontent text. However, randomized and normal paragraphs tags or end tags are deleted. Besides, mismatchedare commonly inserted in spams nowadays, and thus if an nonempty tags are also deleted.e-mail abstraction is generated by the whole content text, 3. All empty tags2 are regarded as the same and arethe near-duplicate part of spams cannot be captured. replaced by the newly created empty= tag.Moreover, generating e-mail abstraction with the content Moreover, successive empty= tags are prunedtext also suffers from the problem of not being applicable to and only one empty= tag is retained.all languages. 4. The pairs of nonempty tags enclosing nothing are In light of the above problems, it deserves further studies removed.to design a better e-mail abstraction approach that is morerobust to withstand intentional attacks. Example 1. Consider the example e-mail abstraction in Fig. 2 that has been produced through the execution of lines 1-53 E-MAIL ABSTRACTION SCHEME in procedure SAG. The first operation of the preprocessing step is to remove tags which are in front of the body tagIn this section, a novel e-mail abstraction scheme is and which are in rear of the =body tag. With regard tointroduced. In Section 3.1, procedure SAG is presented to operation 2, since there is no end tag of a, this tag isdepict the generation process of an e-mail abstraction. The deleted. Besides, the tags font and =font are alsodevised data structures SpTable and SpTrees are illustrated deleted because the position of =font is incorrect. Notein Section 3.2. Finally, the robustness issue is discussed in that we can utilize the stack data structure to determineSection 3.3. whether nonempty tags are mismatched. After that, empty3.1 Structure Abstraction Generation tags are transformed to empty= in operation 3. More-We propose the specific procedure SAG to generate the e-mail over, since mytext=img=hr= appear consecu-abstraction using HTML content in e-mail. SAG is elaborated tively, only one empty= tag is retained. Finally,with the example of Fig. 3, and the algorithmic form of SAG is divfont=font=div are removed due to theoutlined in Fig. 1. Procedure SAG is composed of three major lack of content.phases, Tag Extraction Phase, Tag Reordering Phase, andanchor Appending Phase. In Tag Extraction Phase, thename of each HTML tag is extracted, and tag attributes and 2. In the HTML Document Type Definition (DTD), empty tags contain noattribute values are eliminated. In addition, each paragraph text and have no end tags. For example, br=, hr=, img=, input=of text without any tag embedded is transformed to are empty tags. The newly created tag, mytext=, is also regarded as an empty tag in this paper. On the other hand, all tags that are not declared inmytext=. In lines 4-5, anchor tags are then inserted the DTD as empty tags are nonempty tags. Nonempty tags must have aninto AnchorSet, and the first 1,023 valid tags are concatenated end tag.
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 673 Fig. 4. The data structures of SpTable and SpTrees. appended in front of the e-mail abstraction. The main objective of appending anchor tags is to reduce the probability that a ham is successfully matched with reported spams when the tag length of an e-mail abstraction is short.Fig. 3. An example procedure flow of SAG. An example e-mail abstraction produced by procedure SAG is shown in the bottom of Fig. 3. The middle part of Fig. 3 shows an example of a tentative 3.2 Design of SpTable and SpTreese-mail abstraction and AnchorSet (i.e., spam:com) One major focus of this work is to design the innovativederived from Tag Extraction Process. data structure to facilitate the process of near-duplicate On purpose of accelerating the near-duplicate matching matching. SpTable and SpTrees (sp stands for spam) areprocess, we reorder the tag sequence of an e-mail abstrac- proposed to store large amounts of the e-mail abstractionstion in Tag Reordering Phase. Note that since the arrange- of reported spams. As shown in Fig. 4, several SpTrees arement of HTML tags is regular and in pairs, various the kernel of the database, and the e-mail abstractions ofsequential patterns of tags are contained in e-mails. In the collected spams are maintained in the correspondingworst case, if we consider two e-mail abstractions which SpTrees. According to Definition 3, two e-mail abstractionshave the same tag length and differ only in their last tags, are possible to be near-duplicate only when the numbers ofthe difference cannot be detected until the last tags are their tags are identical. Thus, if we distribute e-mailcompared. To handle this problem, we destroy the abstractions with different tag lengths into diverse SpTrees,regularity by rearranging the order of tag sequence to the quantity of spams required to be matched will decrease.lower the number of tag comparisons. Note that this process However, if each SpTree is only mapped to one single tagensures that the newly assigned position numbers of e-mail length, it is too much of a burden for a server to maintainabstractions with the same number of tags are completely such thousands of SpTrees. In view of this concern, eachidentical. As such, the matching process can be accelerated SpTree is designed to take charge of e-mail abstractionswithout violating the definition of near-duplicate in this within a range of tag lengths. As can be seen in Fig. 4,paper. In lines 8-11 of Fig. 1, each tag is assigned a new SpTable is created to record overall information of SpTrees.position number by function ASSIGN_PN (PN denotes for The ith column of SpTable links to the root of SpTree_i by aPosition Number) with following expressions, pointer, and e-mail abstractions with tag lengths ranging pﬃﬃﬃﬃ from 2i to 2iþ1 À 1 belong to SpTree_i. b ¼ d Le; Regarding how an e-mail abstraction is stored in SpTree, r ¼ ðP Norig À 1Þ%b; Fig. 5 gives an example with the same e-mail abstraction q ¼ bðP Norig À 1Þ=bc þ 1; derived from Fig. 3. An e-mail abstraction is segmented into P Nnew ¼ ðb Â rÞ þ ðb À q þ 1Þ; several subsequences, and these subsequences are consecu- tively put into the corresponding nodes from low levels towhere L is the tag length of an e-mail abstraction, and P Norig high levels. As such, an e-mail abstraction is stored in oneis the original position number. Variable b is the number of path from the root node to a leaf node of SpTree, and hencebuckets. Variable r indicates which bucket should be placed, the matching between a testing e-mail and known spams isand variable q is the number of shift counts from the end of processed from root to leaf. As shown in Fig. 5, the examplethis bucket. Fig. 3 demonstrates the assignment of the first e-mail abstraction is stored in a path from the root node a tosix tags. The final e-mail abstraction is the concatenation of the leaf node d. The primary goal of applying the tree dataall tags with new position numbers (the vacant positions, structure for storage is to reduce the number of tagse.g., positions 9 and 13 in Fig. 3, are ignored). Additionally, if required to be matched when processing from root to leaf.the tag length of an e-mail abstraction is smaller than a Since only subsequences along the matching path from rootpredefined tag length threshold (set as 16 in the experi- to leaf should be compared, the matching efficiency isments) of the short e-mail, the tags in AnchorSet will be substantially increased. Note that if each type of HTML tag
674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 With the hash function, most subsequence matching is transformed to the integer matching, and hence the complex- ity of matching process can be substantially reduced. Overall, the advantageous features of this innovative arrangement are as follows: 1) The height of an SpTree is equal to blg Lc, where L is the tag length of the longest e- mail abstraction in this SpTree. 2) Owing to the fact that parent nodes store less number of subsequences than children nodes, we design that longer subsequences are put into higher levels (the tag length of the subsequence on level i is 2i ). Thus, the number of tags matched from root to leaf is markedly decreased. Moreover, with the hash function, the matching efficiency is substantiallyFig. 5. The illustration of SpTree_3 with an example e-mail abstraction. increased. 3) The numbers of tags stored in the nodes of an SpTree are expected to be similar, and hence SpTreesdetermines a branch direction (i.e., the tree degree will be are balanced binary trees. Assume that there are N e-mailthe number of HTML tag types) and each level of SpTree abstractions in SpTree_i and subsequences on the samecontains merely one HTML tag (i.e., the tree height will be level are uniformly distributed. For each node on level j,the tag length of the longest e-mail abstraction), the number there are N subsequences (the number of nodes on level j 2jof tag comparisons will be minimum. However, it is is 2j ) whose tag lengths are equal to 2j . Thus, the numberinfeasible because the degrees and the heights of SpTrees of tags stored in each node is N Â 2j ¼ N, which is not 2jwill be too large, and SpTrees will be extremely unbalanced. correlated with j. It is noted that the number of tags in To achieve efficient matching with balanced tree struc- each leaf node is less than N because not all subse-ture, SpTrees are designed to be binary trees. The branch quences in leaf nodes are with the longest allowed tagdirection of each SpTree is determined by a binary hash length.function. If the first tag of a subsequence is a start tag (e.g., On the other hand, as shown in the bottom of Fig. 5,div), this subsequence will be placed into the left child certain additional information is required and kept in eachnode. A subsequence whose first tag is an end tag (e.g., subsequence of a node. For subsequences in internal nodes,=div) will be placed into the right child node. Since most spam id and timestamp are included. For subsequences inHTML tags are in pairs and the proposed e-mail abstraction external nodes, spam id, user id, timestamp, length EA (theis reordered in procedure SAG, subsequences are expected length of the e-mail abstraction), and SR (the reputationto be uniformly distributed. Moreover, on level i of each score) are involved.SpTree (with the root on level 0), each node storessubsequences whose tag lengths are equal to 2i . For instance, 3.3 Robustness Issueas shown in Fig. 3, the subsequence spam:com (whose The main difficulty of near-duplicate spam detection is totag length is 20 ) is placed into level 0, the subsequence withstand malicious attack by spammers. Prior approaches=pa (whose tag length is 21 ) is placed into level 1, and generate e-mail abstractions based mainly on hash-basedso forth. Note that since SpTree_i takes charge of e-mail content text. These methods primarily differ in whatabstractions with tag lengths ranging from 2i to 2iþ1 À 1, granularity is used as the input of the hash function. Forbased on the above-mentioned arrangement, the last example, the authors in , ,  extract words orsubsequence of each e-mail abstraction in an SpTree will terms to generate the e-mail abstraction. Besides, substringsbe stored in the leaf nodes on the same level. Also note that extracted by various techniques are widely employed in ,the tag lengths of subsequences stored in leaf nodes of level j , , , , , , . However, this type of e-range from 1 to 2j . As described in Section 3.1, we design mail representation inherently has following disadvantages.that the longest length of an e-mail abstraction is 1,023, First, the insertion of a randomized and normal paragraphmeaning that there are totally 10 SpTrees (from SpTree_0 to can easily defeat this type of spam filters. Moreover, sinceSpTree_9) in our database while the proposed arrangement the structures and features of different languages areis applied. In addition, to further accelerate the process ofmatching, we employ a hash function to map each diverse, word and substring extraction may not be applic-subsequence to an integer. The key idea is that only able to e-mails in all languages. Concretely speaking, forsubsequences which look like the testing subsequence instance, trigrams of substrings used in , ,  are notshould be exactly matched. The hash function is defined suitable for nonalphabetic languages, such as Chinese.as follows: In this paper, we devise a novel e-mail abstraction scheme that considers e-mail layout structure to represent e-mails. hashðseqÞ ¼ fðseq½0Þ Ã 2mÀ1 þ fðseq½1Þ Ã 2mÀ2 To assess the robustness of the proposed scheme, we model þ Á Á Á þ fðseq½m À 1Þ Ã 20 ; possible spammer attacks and organize these attacks as following three categories. Examples and the outputs ofwhere m is the number of tags in this subsequence and seq½n preprocessing of procedure SAG are shown in Fig. 6.denotes the tag type of the nth tag. The function f convertseach type of tag to a unique integer. Moreover, for the 3.3.1 Random Paragraph Insertionsubsequence which contains more than eight tags, we just This type of spammer attack is commonly used nowa-use the first eight tags to generate the hash value (i.e., m 8). days. As shown in Fig. 6, normal contents without any
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 675 Fig. 7. System model of Cosdes. Cosdes to perform more robustly. We shall assess theFig. 6. Examples of possible spammer attacks. effectiveness of our approach with real e-mail streams in Section 5.2.advertisement keywords are inserted to confuse text-based spam filtering techniques. It is noted that our 4 COLLABORATIVE SPAM DETECTION SYSTEMscheme transforms each paragraph into a newly created COSDEStag mytext=, and consecutive empty tags will then betransformed to empty=. As such, the representation of A complete collaborative spam detection system Cosdes iseach random inserted paragraph is identical, and thus our introduced in this section. The system model of Cosdes isscheme is resistant to this type of attack. given in Section 4.1. We then elaborate the processing handlers of Cosdes in Section 4.2. Finally, we describe the3.3.2 Random HTML Tag Insertion reputation mechanism of Cosdes in Section 4.3.If spammers know that the proposed scheme is based on 4.1 System Model of CosdesHTML tag sequences, random HTML tags will be inserted The system model of Cosdes is illustrated in Fig. 7, and therather than random paragraphs. On the one hand, arbitrary algorithmic form is outlined in Fig. 8. Initially, threetag insertion will cause syntax errors due to tag mismatch-ing. This may lead to abnormal display of spam content that parameters, Tm (the maximum time span for reported spamsspammers do not wish this to happen. On the other hand, being retained in the system), Td (the time span for triggeringprocedure SAG also adopts some heuristics (as depicted in Deletion Handler), and Sth (the score threshold for deter-Section 3.1) to deal with the random insertion of empty tags mining spams) should be given for Cosdes. Before startingand the tag mismatching of nonempty tags. Fig. 6 shows to do the spam detection, Cosdes collects feedback spams fortwo example outputs and the details of each step can be time Tm in advance to construct an initial database. Threefound in Fig. 2. With the proposed method, most random major modules, Abstraction Generation Module, Databaseinserted tags will be removed, and thus the effectiveness of Maintenance Module, and Spam Detection Module, arethe attack of random tag insertion is limited. We shall verifythis inference in Section 220.127.116.11.3 Sophisticated HTML Tag InsertionSuppose that spammers are more sophisticated, they mayinsert legal HTML tag patterns. As shown in Fig. 6, if tagpatterns that do conform to syntax rules are inserted, theywill not be eliminated. However, although some craftytricks may be conceivable, it is not intuitive for spammers togenerate a large number of spams with completely distincte-mail layout structure. Note that due to space limitation, we are not able todiscuss all possible situations. Nevertheless, representing e-mails with layout structure is more robust to most existingattacks than text-based approaches. Even though new attackhas been designed, we can react against it by adjusting thepreprocessing step of procedure SAG. On the other hand,our approach extracts only HTML tag sequences andtransforms each paragraph with no tag embedded tomytext=, meaning that the proposed abstraction schemecan be applied to e-mails in all languages without modifyingany components. This important feature also enables system Fig. 8. Algorithmic form of system Cosdes.
676 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011included in Cosdes. With regard to Abstraction GenerationModule, each e-mail is converted to an e-mail abstraction byStructure Abstraction Generator with procedure SAG. Threetypes of action handlers, Deletion Handler, InsertionHandler, and Error Report Handler, are involved inDatabase Maintenance Module. Note that although the term“database” is used, the collection of reported spams can beessentially stored in main memory to facilitate the process ofmatching. In addition, Matching Handler in Spam DetectionModule takes charge of determining results. There are three types of e-mails, reported spam, testinge-mail, and misclassified ham, required to be dealt with byCosdes. When receiving a reported spam, InsertionHandler adds the e-mail abstraction of this spam into thedatabase except that the reputation score of this reporter istoo low. Whenever a new testing e-mail arrives, MatchingHandler performs the near-duplicate detection with col-lected spams to do the judgment. Meanwhile, if a testing e-mail is classified as a spam, this e-mail will be viewed as areported spam and be added into the database. Moreover,Error Report Handler copes with feedback misclassifiedhams and adjusts Cosdes by degrading the reputation ofrelated reporters to prevent malicious attacks. For every Td ,Deletion Handler is triggered to delete obsolete spamswhich exist over time Tm . The main functionalities ofdeleting outdated spams are not only to alleviate theoverhead of the server, but to reduce the risk of accidentaldeletion of hams. Due to the evolving nature of spams, it isinappropriate to utilize old spams to filter current ones.Overall, Cosdes is self-adjusting and retains the most up-to-date spams for near-duplicate detection.4.2 Procedures of System CosdesIn this section, we elucidate each procedure of systemCosdes. Cosdes deals with four circumstances by handlers(the algorithmic forms are shown in Fig. 9), and the detailedprocedure flow will be explained as follows: For InsertionHandler in Fig. 9a, initially, the corresponding SpTree isfound in SpTable according to the tag length of the insertedspam, and nowNode is assigned as the root of this SpTree. Inlines 3-8, we iteratively insert the subsequences of thee-mail abstraction along the path from root to leaf. IfnowNode is an internal node, the subsequence with 2i tagsis inserted into level i, which is illustrated in Fig. 5.Meanwhile, the hash value of this subsequence is com-puted. Then, nowNode is assigned as the correspondingchild node based on the type of the next tag. If the next tagis a start (end) tag, nowNode is assigned as the left (right)child node. Finally, when nowNode is processed to a leafnode, the subsequence with remaining tags is stored. Fig. 9. Algorithmic forms of all processing handlers. (a) Algorithmic form of Insertion Handler. (b) Algorithmic form of Matching Handler. Matching Handler (as shown in Fig. 9b) is the most (c) Algorithmic form of Error Report Handler. (d) Algorithmic form ofsignificant procedure in Cosdes to achieve efficient match- Deletion Handler.ing between every testing e-mail and the known spamdatabase. There are two major phases in the matching at positions 2i without doing tag comparisons. It is certainprocess: Approximate Matching Phase and Exact Matching that a testing e-mail may merely be near-duplicate withPhase. As mentioned in Section 3.2, the tag lengths of e-mail spams which have the same tag length and are in the sameabstractions in an SpTree may not be identical. However,two e-mail abstractions are possible to be near-duplicate path. Therefore, we tentatively record the information ofonly when the numbers of their tags are identical. For this spams, which appear in the targeted leaf node and have thereason, in Approximate Matching Phase, we traverse same tag length, into a candidate set candSet. The maindirectly to the targeted leaf node based on the types of tags objectives of the approximate matching are: 1) to reduce
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 677unnecessary tag comparisons of e-mails with different tag 2. If a reporter submits any feedback spam once more,lengths, and 2) to exclude e-mails which can be determined the reputation score will be incremented by a smallerwithout the exact tag matching. Subsequently, the process incremental score Sincre . The value of Sincre is set asstarts Exact Matching Phase from the root of the SpTree. For Sinitial in the experiments. 10each level, in lines 11-17, the hash values of subsequences 3. If a reporter is charged that his previous feedbackare matched first. Then, we do the exact matching of spam is mistaken, the reputation score will besubsequences only if their hash values are matched. The halved.unmatched information of spams will be deleted from To prevent malicious error reports and to attain a near-zerocandSet. Moreover, we design that the exact tag matching isonly processed to level f level (set as 3 in the experiments), false positive rate, we cautiously increase the reputationnamely, only the first 2f levelþ1 À 1 tags are exactly matched. score but drop it drastically while a false positive error isIt means that the looser matching criterion is applied when issued. On the other hand, when SR of a reporter is smallerthe length of an e-mail abstraction is longer than 2f levelþ1 . than Sinitial , his subsequent feedback spams will not beThis looser criterion substantially promotes the efficiency of added into the database until SR is equal to or larger thanmatching but does not influence detection results owing to Sinitial . Regarding the parameter Sth , we simply use a fixedthe effects of the preceding approximate matching and the small value (set as three in the experiments) instead oftag reordering process of procedure SAG. Finally, if the sum determining the threshold according to the ratio of totalof SR of all candidate spams in candSet exceeds Sth , the users. The reason is that as long as there are certain trustytesting e-mail will be classified as a spam. users reporting the e-mails with the same e-mail abstraction To handle e-mails with only plain text, we maintain a as spams, it is sufficiently reliable to classify the subsequentblacklist of not only sender addresses but also URLs and near-duplicate e-mails as spams.e-mail addresses included in the content of reportedspams. Gathering these feedback information to blocktext/plain spams is very effective since spammers have to 5 PERFORMANCE EVALUATIONlet receivers connect to them so as to make profits. To assess the feasibility of system Cosdes, we conduct severalTherefore, Cosdes still can cope with spams with merely experiments to explore its efficiency and detection results.plain text even though we concentrate on e-mail layout The real spam data sets used in the experiments are from thestructure in this paper. When receiving a misclassified ham, Error Report e-mail servers of Computer Center in National TaiwanHandler (shown in Fig. 9c) first finds the corresponding University, which has over 30,000 students. Since the groundSpTree and does the matching process as the same in truth of real e-mail streams is unavailable, spams areMatching Handler. For the spams matched with the extracted from the well-known existing system, SpamAssas-reported misclassified ham, we reset SR of these spams as sin.3 Concerning hams, we not only include public data sets0 to avoid subsequent misclassification incurred by the (around 4,000 e-mails) provided by SpamAssassin,4 but alsoidentical group of spams. In addition, the reputation scores obtain from volunteers. There are about 60,000 spams per dayof reporters who cause the false positive error are halved to and a set of 7,000 or so hams in the data set. Note thatprevent continuous attacks by specific users. numerous related works have evaluated the proposed Moreover, to delete obsolete spams, for every Td , methods with static databases. However, to access theDeletion Handler (as shown in Fig. 9d) traverses each performance of spam detection system with near-duplicateSpTree in inorder (traverses the left subtree, visit the root, matching scheme, real e-mail streams are more appropriateand then traverses the right subtree) to visit all nodes in than static data sets. Therefore, in this paper, we useSpTrees. For each subsequence, if the existing time exceeds university-scale e-mail streams as the experimental data setsTm , it will be viewed as outdated and be deleted from this to better simulate the e-mail environment. On the other hand,node. As such, all obsolete spams are removed from theknown spam database after Deletion Handler is processed. three representative approaches , ,  of near- duplicate spam detection are employed for comparison.4.3 Reputation Mechanism The authors of ,  also adopt the same e-mailThe principal concept of collaborative spam detection is to representation approach as in  but with different sharingcollect human judgment to block subsequent near-duplicate mechanisms. For ease of presentation, Damiani’s work isspams. To ensure the truthfulness of spam reports and to abbreviated as Digest. Sarafijanovic’s work is abbreviated asprevent malicious attacks, we propose the reputation MultiDigest, and Yoshida’s work is abbreviated as Density. Itmechanism to evaluate the credit of each reporter. The is worth mentioning that Sarafijanovic’s work  improvesfundamental idea of the reputation mechanism is to utilize a Damiani’s one  by representing each e-mail with multiplereputation table to maintain a reputation score SR of each digests produced from the strings of fixed length sampled atreporter according to the previous reliability record. Each randomized positions within e-mail. The processes ofinserted spam is given a suspicion score equal to SR of the generating each digest in Digest and MultiDigest arereporter. In such a context, when doing near-duplicate identical. Although Sarafijanovic’s work claims that usingdetection, if the sum of suspicion scores of matched spamsexceeds a predefined threshold, the testing e-mail will be multiple digests can enhance the robustness of near-duplicateclassified as a spam. The reputation mechanism is described spam detection system, these two works have not beenin detail as follows: 3. http://spamassassin.apache.org. 4. http://spamassassin.apache.org/publiccorpus. We include three files, 1. Each reporter is assigned an initial score Sinitial when that is, 20030228_easy_ham.tar.bz2, 20030228_easy_ham_2.tar.bz2, and he submits a reported spam at the first time. 20030228_hard_ham.tar.bz2.
678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 performed, Cosdes will be the most efficient. However, sequence preprocessing and tag reordering are also executed in Cosdes, and therefore, the execution time slightly increases. Overall, the cost of generating e-mail abstractions of four approaches is very low. With regard to the space issue, Fig. 10b shows the memory usage of four approaches with the number of e-mails varied in the database. Note that we estimate a hash value or an integer number as one unit of memory usage, which approximates to 2 bytes. Digest represents eachFig. 10. The execution time and the memory usage of generating e-mail e-mail with a 32-byte code, which is equal to 16 units. Asabstractions with the number of e-mails varied. (a) Execution time of e- mentioned above, MultiDigest utilizes multiple 32-bytemail representation. (b) Memory usage of e-mail representation. codes for the representation of each e-mail. In the experi- mental data set, the average number of digests in eachvalidated by real e-mail streams. Besides, Yoshida’s work e-mail is approximately 12, and thus the memory usage ofconsiders maintaining a direct-mapped cache to facilitate the MultiDigest is larger than that of Digest by around 12 times.process of matching. In the experiments of his work , there Regarding Density, N hash values, that is, 100 units, are theare 10 million spams in the database and 10 percent of hash representation of each e-mail. However, since some e-mailsvalues are copied in the cache. However, to fairly compare the are too short to be extracted N hash values, the averagedetection performance, the cache mechanism of Yoshida’s memory usage of each e-mail in Density is smaller thatwork is not included in our experiments, meaning that all 100 units. On the other hand, a sequence of HTML tags isspams in the database are used for detection. We implement the representation of each e-mail in Cosdes . We can replaceCosdes and comparative techniques with C++ language, and each type of tag with a unique integer, and thus each tagthe programs are executed in Windows XP professional can be viewed as 1 unit. It is calculated that the averageplatform with Pentium 4—3GHz CPU and 1GB RAM. The length of the HTML tag sequence produced by procedureprograms of Digest and MultiDigest are implemented with SAG of Cosdes is approximately 35. In addition, as stated insource codes shared by original authors of  and . Section 3.2, additional information is required in SpTableInitially, the efficiency and the space usage of generating e- and SpTrees of Cosdes. We include this memory overheadmail representation are investigated in Section 5.1. We as well. It can be observed in Fig. 10b that the memorycompare and analyze the detection results of four approaches usage of Digest is the least, and MultiDigest uses the largestin Section 5.2. The detailed efficiency analysis is presented in memory space among four approaches. As for Cosdes, theSection 5.3. Finally, Section 5.4 simulates the spammer attack memory usage is larger than that of Digest by two to fourof random HTML tag insertion. times. Although a fairly succinct e-mail abstraction can5.1 E-mail Representation greatly reduce the overhead of near-duplicate matching, we can find in the following section that the effectiveness ofThe processing time of the generation of e-mail abstractions Digest cannot be validated.with the number of e-mails varied is shown in Fig. 10a. Asmentioned in Section 2.2, most prior works on near-duplicate 5.2 Accuracy Evaluationspam detection represent e-mails based mainly on content In this section, we evaluate the detection performance oftext. Cosdes is the first work to attempt to utilize HTML Cosdes and three competitive approaches. The mostcontent, which depicts the layout structure of an e-mail, for important requirement for a spam detection system is therepresentation. Regarding Digest, word trigrams are ex- capability to resist malicious attack that evolves continu-tracted consecutively along the whole e-mail content. As for ously. To examine this capability, two recent streams ofDensity, the authors acquire the first N (in , N is set as spams (collected from National Taiwan University in100) hash values of each length L substring with a fixed-size September 2007 and February 2009) are utilized as thewindow sliding through an e-mail. As can be seen in Fig. 10a, experimental data sets. Regarding the language of contentDensity takes the least time since it computes only the first N text, 80 percent of all e-mails are in Chinese, and 15 percenthash values. As for Digest, hash values of word trigrams in of them are in English. The minority of e-mails are inthe whole e-mail are required to be computed. Regarding Japanese, French, and so forth. Since Chinese is aMultiDigest, each e-mail is represented by a set of digests, nonalphabetic language and English is an alphabetic one,meaning that each e-mail is separated into multiple strings the data set used in the experiments can verify thewith fixed length (in , the length is set as 60 characters) effectiveness of spam detection system with different kindssampled at randomized positions. Although the length of of languages to a certain extent. Before a system starts to doeach string is much shorter, the overall complexity still the near-duplicate spam detection, a set of known spams isincreases since each e-mail has multiple strings needed to be inserted into the system. We consider situations with theprocessed. Fig. 10a shows that the processing time of parameter Tm varied from one to five days. That is, asMultiDigest is about four times longer than that of Digest, shown in the left side of Fig. 11, the detection results areand thus MultiDigest takes the longest time among four produced by inserting spams within Tm days first, and thenapproaches. In procedure SAG of Cosdes, HTML tags are the following one-day spams are tested. Note that eachextracted and each paragraph of text is transformed to the spam is inserted into the database after the process ofnewly created tag mytext=. If only these operations are matching. On the other hand, the entire set of hams is tested
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 679Fig. 11. Performance of detection results.in each situation. True positive rate (i.e., TP, a real spam isclassified as a spam) and false positive rate (i.e., FP, a realham is misclassified as a spam) are listed in Fig. 11. As can be seen in Fig. 12a, Cosdes reports 96.47 percentTP rate and 0.46 percent FP rate on average, which has themost outstanding performance. The TP rate of Digest isextremely high but the FP rate is unacceptable. In order to Fig. 12. The comparison of detection performance and the performanceaccelerate the process of near-duplicate matching, only a 32- of Cosdes with and without sequence preprocessing in procedure SAG. (a) Average detection results. (b) Impact of sequence preprocessing andbyte code is used in Digest to represent each e-mail. anchor-appending. (c) Impact of score threshold (Sth) in Cosdes.Moreover, as defined in , two e-mails are determined asnear-duplicate if more than 182 bits of their 32-byte (i.e., comparative approaches. These factors could partially256 bits) codes have the same value. It is shown in Fig. 12a account for the results shown in Fig. 11.that as the size of spam database is large, the 32-byte code is Concerning Cosdes, we extract more essential informa-not discriminative to clearly distinguish each e-mail, and tion to represent each e-mail, and the newly devised e-mailthus hams are easily mismatched with known spams. As for abstraction can more effectively capture the near-duplicateMultiDigest, although the authors claim in  that using phenomenon of spams with an acceptable FP rate. More-multiple digests to represent each e-mail can be more robust over, since we represent each e-mail using HTML tagagainst increased obfuscation effort by spammers, the FP sequence rather than content text, Cosdes can intrinsicallyrate of MultiDigest is even worse than that of Digest as the apply to both alphabetic and nonalphabetic languagessize of spam database is large. This is owing to the reason without modifying any components of the system. Thisthat MultiDigest separates each e-mail into a set of short advantageous property is verified with our data set thatstrings. As long as one digest in the huge spam database is consists of 15 percent English e-mails and 80 percentsimilar to one of digests in the testing e-mail, this e-mail will Chinese ones. In addition, to further investigate thebe classified as a spam. In  and , there are only components of Cosdes, we evaluate the detection perfor-2,500 spams and 2,500 hams in the data set, which might not mance when either the sequence preprocessing step or thesuffice to reflect the real situation. In addition, the effective- anchor-appending step of procedure SAG is removed. It canness of Digest and MultiDigest has not been validated by be observed in Fig. 12b that the performance almost doesreal e-mail streams. On the other hand, the effectiveness of not degrade as we exclude the sequence preprocessing step.Density has been evaluated in  with 10 million spams in This consequence reveals that the proposed abstractionthe database. One problem of Density is that a huge number scheme has not been countered. Nevertheless, we proposeof known spams are required to make the proposed cache elaborate design to guarantee the robustness of Cosdes. Onmechanism of Density work well. However, in our experi- the other hand, as depicted in Section 3.1, the main objectivements, even though we do not consider the cache mechan- of appending anchor tags in front of the e-mailism of Density and all reported spams are used for near- abstraction is to reduce the probability that a ham isduplicate spam detection, the effectiveness of Density on successfully matched with reported spams when the tagmore recent e-mail streams cannot be validated. Moreover, length of an e-mail abstraction is short. Anchor-appendingseveral parameters should be given for Density and be process will be included as the length of an e-mailadjusted according to different environments. Since the abstraction is smaller than 16. It can be seen in the rightauthors in  do not provide the parameter tuning method, side of Fig. 12b that the TP rate increases slightly but the FPin this experiment, we follow the same setting as in  and rate is twice higher than that of the original situation as thisobtain the results in Fig. 11. Note that various tricks targeting process is removed. This is because there are several hamsat nullifying the approaches of hash-based text representa- containing only a URL that normal users want to share withtion have been increasingly employed recently. Besides, their friends. If the anchor-appending process is removed,most spams in our data set are in Chinese, which is a these e-mails will be misclassified as spams, and thereby thenonalphabetic language. However, Digest and MultiDigest FP rate deteriorates.generate hash values with trigrams of substrings. The Subsequently, we explore the impact of score thresholdmethod of this kind is essentially designed to resist the that determines the number of matched spams that Cosdesword obfuscation of nonalphabetic languages. Therefore, the will detect a testing e-mail as a spam. Fig. 12c shows the TPresults in this section also verify the language restriction of rate and the FP rate of Cosdes with score threshold Sth
680 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011Fig. 13. The comparison of matching efficiency and the efficiency of Fig. 14. Performance of Insertion Handler and Deletion Handler inCosdes with and without the hash function in the matching process. Cosdes. (a) Execution time of insertion of Cosdes. (b) Execution time of(a) Execution time of matching of 4 approaches. (b) Impact of hash in deletion of Cosdes.Cosdes devised data structure and the customized matchingvaried from 1 to 5. It can be observed that as Sth increases scheme, the execution time is still controlled in two to fourbut remains small, the detection results of Cosdes are still times of Digest. Moreover, regarding the effectiveness of theconsistent. This means that no complex heuristic method is proposed hash function in the matching process, Fig. 13brequired to determine the appropriate value of Sth . More- shows that a speedup of approximately three times isover, this threshold is not required to be adjusted according achieved. According to above results, it is shown that withto the size of the data set. Note that as the FP rate increases the design of a more sophisticated e-mail abstraction,to a certain unacceptable value, our system can simply Cosdes gains better near-duplicate detection results withresponse by slightly decreasing the value of Sth . The still efficient matching time.property of simple threshold setting is also an advanta- Subsequently, we conduct the efficiency investigation ofgeous feature of Cosdes. Cosdes on inserting e-mail abstractions into the database5.3 Efficiency Analysis and deleting outdated spams from the database. Owing toIn the succeeding experiments, we initially examine the the fact that competitive approaches, Digest and Density,efficiency of near-duplicate matching. Owing to the fact that did not isolate insertion parts from the systems and did noteach incoming e-mail has to be matched with a huge spam take account of deletion, we only study the performance ofdatabase, the efficiency of near-duplicate matching is Insertion Handler and Deletion Handler in Cosdes. Fig. 14acrucial to a collaborative spam detection system. On shows the execution time of Insertion Handler of Cosdesevaluation of matching performance, we consider the with the number of e-mails varied. The execution timesituation of matching with the number of e-mails varied grows linearly and costs merely 3.5 seconds for insertingwhile there are identical e-mails in the database. As shown 100,000 spams into the database. Moreover, as can be seenin Fig. 13a, the execution time of Digest is minimal since in Fig. 10a, the process of generating 100,000 e-mailonly two 32-byte codes of each pair of e-mails have to be abstractions costs about 10 seconds. On the other hand,compared. However, as shown in Fig. 12a, the detection the performance of Deletion Handler is shown in Fig. 14b.results of Digest are not satisfied even if their matching We evaluate the execution time of deleting spams in oneprocesses are very fast. As for MultiDigest, it is defined in day while the number of e-mails in the database varied. The that the similarity measure between two e-mails is the main purpose of this experiment is to examine whether themaximum number of equal bits over all pairs of digests. efficiency of deletion will be influenced by the amount of e-According to this definition, processing all pairs of digests mails stored in SpTrees. It is shown that the deletionbetween two e-mails requires n2 comparisons, where n is process costs only 2 to 3 seconds in each situation, and thethe average number of digests in an e-mail. It is calculated execution time slightly increases with the amount ofthat n is close to 12 in our data set, meaning that the e-mails. Therefore, we can observe that both the processesmatching time of MultiDigest is larger than that of Digest by of insertion and deletion in Cosdes are efficient and incurover 100 times. This indicates that the matching process of very little overhead.MultiDigest is the least efficient among four approaches. To further evaluate the proposed e-mail abstractionRegarding Density, the sequences of the first 100 hash scheme, we consider the sequence preprocessing step andvalues, which are longer than Digest and Cosdes, are the reordering step of procedure SAG. The primarymatched, and therefore Density takes more time than Digest objective of the sequence preprocessing is to preventand Cosdes. It is noted that the authors in  propose a malicious tag insertion attack, and thus the robustness ofcache mechanism to avoid matching each e-mail with the Cosdes can be enhanced. Fig. 15a shows that generatinghuge database. The cache mechanism enables Density to e-mail abstractions with the sequence preprocessing stepmarkedly enhance the efficiency of matching. However, this leads to little increment of execution time. Although thewill degrade the detection performance while the spam detection results in Fig. 12b can be inferred that spammersdatabase is not as huge as in . To fairly compare the still do not intend to obfuscate HTML content, thisperformance, we ignore the cache mechanism of Density in protection process enables Cosdes to perform morethis experiment. On the other hand, Cosdes has to match a robustly in the future. On the other hand, the main purposelonger sequence than Digest, and in essence Cosdes of the reordering step is to differentiate e-mails with similarrequires more time for matching. Nevertheless, with the tag sequences in the earlier stage of matching. As shown in
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 681Fig. 15. Efficiency analysis on the sequence preprocessing and the Fig. 16. Effectiveness and efficiency analysis of the sequencereordering steps of Cosdes. (a) Impact of sequence preprocessing in preprocessing step under the spammer attack of random HTML tagCosdes. (b) Impact of reordering in Cosdes. insertion. (a) Probability of matching. (b) Execution time of the sequence preprocessing step.Fig. 15b, 10 percent of efficiency improvement is achieved.The improvement is limited since we map each subse- matching and to progressively update the known spamquence in a node of an SpTree to a hash value. Therefore, database. Consequently, the most up-to-date informationthe subsequences that have some prefix tags in common can be invariably kept to block subsequent near-duplicatestill can be differentiated with one comparison. Never- spams. In the experimental results, we show that Cosdestheless, the reordering step certainly can further accelerate significantly outperforms competitive approaches, whichthe matching process. indicates the feasibility of Cosdes in real-world applications.5.4 Spammer AttackTo further verify the robustness of Cosdes, in this section, ACKNOWLEDGMENTSwe simulate the spammer attack of random HTML tag This work was supported in part by the National Scienceinsertion. We consider the situation that a spammer sends Council of Taiwan under Contracts NSC97-2221-E-002-n identical e-mails at a time, where n is varied from 1,000 to 172-MY3.100,000. It is assumed that a sequence of random HTML tagsis inserted into the beginning of each e-mail. The number oftags in a sequence is a random number between 1 and 50.5 REFERENCESRegarding the type of tag, we randomly choose them from  E. Blanzieri and A. Bryl, “Evaluation of the Highest ProbabilityHTML tag list.6 As shown in Fig. 16a, around 90 percent of e- SVM Nearest Neighbor Classifier with Variable Relative Errormails are matched with other e-mails, meaning that only Cost,” Proc. Fourth Conf. Email and Anti-Spam (CEAS), 2007.  M.-T. Chang, W.-T. Yih, and C. Meek, “Partitioned Logistic10 percent of spams have completely distinct HTML tag Regression for Spam Filtering,” Proc. 14th ACM SIGKDD Int’l Conf.sequences as random HTML tag insertion is applied. This is Knowledge Discovery and Data mining (KDD), pp. 97-105, 2008.because the sequence preprocessing step of procedure SAG  S. Chhabra, W.S. Yerazunis, and C. Siefkes, “Spam Filtering Usingwill delete nonempty tags that have no corresponding start a Markov Random Field Model with Variable Weighting Schemas,” Proc. Fourth IEEE Int’l Conf. Data Mining (ICDM),tags or end tags. Random HTML tag insertion cannot pp. 347-350, 2004.generate legal tag sequences and thus most tags will be  P.-A. Chirita, J. Diederich, and W. Nejdl, “Mailrank: Usingeliminated. Concerning the efficiency analysis, it can be Ranking for Spam Detection,” Proc. 14th ACM Int’l Conf. Information and Knowledge Management (CIKM), pp. 373-380, 2005.observed in Fig. 16b that the sequence preprocessing step  R. Clayton, “Email Traffic: A Quantitative Snapshot,” Proc. of theincurs very little overhead. This also indicates that if new Fourth Conf. Email and Anti-Spam (CEAS), 2007.spammer attack occurs, Cosdes still has capacity to react  A.C. Cosoi, “A False Positive Safe Neural Network; The Followersagainst it by applying a more sophisticated countermeasure. of the Anatrim Waves,” Proc. MIT Spam Conf., 2008.  E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati, “An Open Digest-Based Technique for Spam Detection,” Proc. Int’l Workshop Security in Parallel and Distributed Systems, pp. 559-564,6 CONCLUSION 2004.In the field of collaborative spam filtering by near-duplicate  E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati, “P2P-Based Collaborative Spam Detection and Filtering,” Proc.detection, a superior e-mail abstraction scheme is required to Fourth IEEE Int’l Conf. Peer-to-Peer Computing, pp. 176-183, 2004.more certainly catch the evolving nature of spams. Com-  P. Desikan and J. Srivastava, “Analyzing Network Traffic topared to the existing methods in prior research, in this paper, Detect E-Mail Spamming Machines,” Proc. ICDM Workshop Privacywe explore a more sophisticated and robust e-mail abstrac- and Security Aspects of Data Mining, pp. 67-76, 2004.  H. Drucker, D. Wu, and V.N. Vapnik, “Support Vectortion scheme, which considers e-mail layout structure to Machines for Spam Categorization,” Proc. IEEE Trans. Neuralrepresent e-mails. The specific procedure SAG is proposed to Networks, pp. 1048-1054, 1999.generate the e-mail abstraction using HTML content in  D. Evett, “Spam Statistics,” http://spam-filter-review.toptene-mail, and this newly-devised abstraction can more reviews.com/spam-statistics.html, 2006.effectively capture the near-duplicate phenomenon of  A. Gray and M. Haahr, “Personalised, Collaborative Spam Filtering,” Proc. First Conf. Email and Anti-Spam (CEAS), 2004.spams. Moreover, a complete spam detection system Cosdes  S. Hershkop and S.J. Stolfo, “Combining Email Models for Falsehas been designed to efficiently process the near-duplicate Positive Reduction,” Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 98-107, 2005. 5. It is calculated that the average length of the HTML tag sequence  J. Hovold, “Naive Bayes Spam Filtering Using Word-Position-produced by procedure SAG of Cosdes is approximately 35. Based Attributes,” Proc. Second Conf. Email and Anti-Spam (CEAS), 6. http://www.w3schools.com/tags/default.asp. 2005.
682 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 A. Kolcz and J. Alspector, “SVM-Based Filtering of Email Spam Chi-Yao Tseng received the BS degree in with Content-Specific Misclassification Costs,” Proc. ICDM Work- electrical engineering from the National Taiwan shop Text Mining, 2001. University, Taipei, in 2004, and is currently a PhD A. Kolcz, A. Chowdhury, and J. Alspector, “The Impact of Feature candidate in the Graduate Institute of Electrical Selection on Signature-Driven Spam Detection,” Proc. First Conf. Engineering of the National Taiwan University. Email and Anti-Spam (CEAS), 2004. His research interests include sequential pattern J.S. Kong, P.O. Boykin, B.A. Rezaei, N. Sarshar, and V.P. mining, e-mail spam detection, social network Roychowdhury, “Scalable and Reliable Collaborative Spam mining, and multimedia data mining. Filters: Harnessing the Global Social Email Networks,” Proc. Second Conf. Email and Anti-Spam (CEAS), 2005. T.R. Lynam and G.V. Cormack, “On-Line Spam Filter Fusion,” Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 123-130, 2006. Pin-Chieh Sung received the BS degree in B. Mehta, S. Nangia, M. Gupta, and W. Nejdl, “Detecting Image computer science from the National Tsing Hua Spam Using Visual Features and Near Duplicate Detection,” Proc. University, Hsinchu, in 2006 and the MS degree 17th Int’l Conf. World Wide Web (WWW), pp. 497-506, 2008. in the network database laboratory (led by V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam Filtering Professor Ming-Syan Chen), Graduate Institute with Naive Bayes—Which Naive Bayes?” Proc. Third Conf. Email of Electrical Engineering, National Taiwan Uni- and Anti-Spam (CEAS), 2006. versity, Taipei, in 2008. Currently, he is a M.S. Pera and Y.-K. Ng, “Using Word Similarity to Eradicate Junk software engineer in Cazoodle Inc. His research Emails,” Proc. 16th ACM Int’l Conf. Information and Knowledge interests include web mining, recommendation Management (CIKM), pp. 943-946, 2007. system, and multimedia data mining. I. Rigoutsos and T. Huynh, “Chung-Kwei: A Pattern-Discovery- Based System for the Automatic Identification of Unsolicited E- Mail Messages (SPAM),” Proc. First Conf. Email and Anti-Spam Ming-Syan Chen received the BS degree in (CEAS), 2004. electrical engineering from the National Taiwan S. Sarafijanovic and J.-Y.L. Boudec, “Artificial Immune System for University, Taipei, and the MS and PhD degrees Collaborative Spam Filtering,” Proc. Second Workshop Nature in computer, information and control engineering Inspired Cooperative Strategies for Optimization (NICSO), 2007. from the University of Michigan, Ann Arbor, in S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Improving Digest- 1985 and 1988, respectively. He is now a Based Collaborative Spam Detection,” Proc. MIT Spam Conf., 2008. distinguished research fellow and the director S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Resolving FP-TP of Research Center for Information Technology Conflict in Digest-Based Collaborative Spam Detection by Use of Innovation (CITI) in the Academia Sinica, Tai- Negative Selection Algorithm,” Proc. Fifth Conf. Email and Anti- wan, and is also a distinguished professor jointly Spam (CEAS), 2008. appointed by EE Department, CSIE Department, and Graduate Institute K.M. Schneider, “Brightmail URL Filtering,” Proc. MIT Spam Conf., of Communication Engineering (GICE) at the National Taiwan Uni- 2004. versity. He was a research staff member at IBM Thomas J. Watson D. Sculley and G.M. Wachman, “Relaxed Online SVMs for Spam Research Center, Yorktown Heights, New York, from 1988 to 1996, the Filtering,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research and director of GICE from 2003 to 2006, and also the president/CEO of the Development in Information Retrieval (SIGIR), pp. 415-422, 2007. Institute for Information Industry (III), which is one of the largest C.-Y. Tseng, J.-W. Huang, and M.-S. Chen, “Promail: Using organizations for information technology in Taiwan, from 2007 to 2008. Progressive Email Social Network for Spam Detection,” Proc. 10th His research interests include databases, data mining, mobile comput- Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), ing systems, and multimedia networking, and he has published more pp. 833-840, 2007. than 280 papers in his research areas. In addition to serving as program Z. Wang, W. Josephson, Q. Lv, and K.L.M. Charikar, “Filtering chairs/vice chairs, and keynote/tutorial speakers in many international Image Spam with Near-Duplicate Detection,” Proc. Fourth Conf. conferences, he was an associate editor of IEEE TKDE, VLDB Journal, Email and Anti-Spam (CEAS), 2007. KAIS, and also JISE, is currently the editor-in-chief of the International K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma, A. Journal of Electrical Engineering (IJEE), and is a distinguished visitor of Nakashima, H. Fujikawa, and K. Yamazaki, “Density-Based Spam the IEEE Computer Society for Asia-Pacific from 1998 to 2000, and also Detector,” Proc. 10th ACM SIGKDD Int’l Conf. Knowledge Discovery from 2005 to 2007. He is now also serving as the chief executive officer and Data Mining (KDD), pp. 486-493, 2004. of Networked Communication Program, which is a national program F. Zhou, L. Zhuang, B.Y. Zhao, L. Huang, A.D. Joseph, and J.D. coordinating several primary activities in information and communication Kubiatowicz, “Approximate Object Location and Spam Filtering technologies in Taiwan. He holds, or has applied for, 18 US patents and on Peer-to-Peer Systems,” Proc. ACM/IFIP/USENIX Int’l Middle- seven ROC patents in his research areas. He is a recipient of the ware Conf., pp. 1-20, 2003. Academic Award of the Ministry of Education, the National Science Council (NSC) Distinguished Research Award, Pan Wen Yuan Distinguished Research Award, Teco Award, Honorary Medal of Information, and K.-T. Li Research Breakthrough Award for his research work, and also the Outstanding Innovation Award from IBM Corporate for his contribution to a major database product. He also received numerous awards for his research, teaching, inventions, and patent applications. He is a fellow of the ACM and the IEEE. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.