Your SlideShare is downloading. ×
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

2,097

Published on

My Master's thesis defense slides for Master's thesis, research for which was conducted under Prof. Kyu-Young Whang and successfully defended in KAIST, Computer Science Dept. on 16th December, 2010.

My Master's thesis defense slides for Master's thesis, research for which was conducted under Prof. Kyu-Young Whang and successfully defended in KAIST, Computer Science Dept. on 16th December, 2010.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,097
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
65
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. December 16, 2010 Database and Multimedia Lab Korea Advanced Institute of Science and Technology (KAIST) Improving the Quality of Web Spam Filtering by Using Seed Refinement Master Thesis Defense Presenter: Qureshi, Muhammad Atif Advisor: Whang, Kyu-Young
  • 2. Contents Introduction Related Work Web Spam Filtering Using Seed Refinement Algorithms Strategy Performance Evaluation Conclusion Jan 7, 2011
  • 3. Web Search Engine Definition [BP98] A system that retrieves relevant web pages for users’ queries from the World Wide Web (WWW). Example Google, Yahoo!, MS Live Search, Naver. Jan 7, 2011 Introduction
  • 4. Web Page Ranking Motivation User queries return huge amount of relevant web pages, but the users want to browse the most important ones. Note: Relevance represents that a web page matches the user’s query. Concept Ordering the relevant web pages according to their importance [GMT04]. Note: Importance represents the interest of a user on the relevant web pages . Methods [ACG01] Link-based method: exploiting the link structure of web for ordering the search results. Content-based method: exploiting the contents of web pages for ordering the search results. We focus on link-based methods since these methods are prevalent in popular search engines [BP98, CDG07, YUT08] such as Google and Yahoo! [YUT08] . Jan 7, 2011 Introduction
  • 5. Link Structure of Web [GGP04] Concept Web can be modeled as a graph G( V , E ) where V is a set of vertices representing web nodes, and E is a set of edges representing directed links between the nodes. Note: W eb node represents either a web page or a web domain. Links are classifed into two classes as follows: The link structure is called web graph. Example Introduction Inlink: the incoming link to a web node. Outlink: the outgoing link from a web node. Fig. 1: An example of a web graph. V = { A , B , C } E = { AB , BC } AB is an outlink of the web node A. BC is an outlink of the web node B. AB is an inlink of the web node B. BC is an inlink of the web node C. A C B
  • 6. Web Page Ranking by Using the Link-based Methods Concept [BP98] A web node is more important if it receives more inlinks. Popular method: PageRank [BP98] Jan 7, 2011 PR [ p ]: PageRank value of the web node p N outlink ( q ): the number of outlinks of the web node q d : damping factor (probability of following an outlink) v [ p ]: the probability of random jump from the web node p to any arbitrary web node Introduction
  • 7. Web Spam [HMS02, GG05] Concept Any deliberate action in order to boost a web node’s rank, without improving its real merit. Link spam: web spam against link-based methods An action that changes the link structure of web in order to boost web node's ranking. Example Jan 7, 2011 Introduction N 3 N 4 N 1 N 2 The web nodes N 1 and N 2 are not involved in link spam, so they care called non-spam nodes … N 5 N x Web nodes N 3 -N x are involved in link spam, so they are called spam nodes Actor creates the web node N 3 to N x I want to boost the rank of the web node N 3 Fig. 2: An example of link spam. Node Link Actor
  • 8. Web Spam Filtering Algorithm Overview The web spam filtering algorithms output spam nodes to be filtered out [GBG06]. In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam nodes (called input seed sets) as an input [ GGP04, KR06, GBG06, WD05]. Spam input seed set: the input seed set containing spam nodes. Non-spam input seed set: the input seed set containing non-spam nodes. The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [ GGP04, KR06, GBG06]. Observation The output quality of web spam filtering algorithms is dependent on that of the input seed sets. The output of the one web spam filtering algorithm can be used as the input of the other web spam filtering algorithm.  The algorithms may support one another if placed in appropriate succession. Introduction
  • 9. Motivation and Goal Motivation There is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms. There is no well-known study on successions among web spam filtering algorithms. Goal Improving the quality of web spam filtering by using seed refinement. Improving the quality of web spam filtering by finding the appropriate succession among web spam filtering algorithms. Jan 7, 2011 Motivation and Goal
  • 10. Contributions We propose modified algorithms that apply seed refinement techniques using both spam and non-spam input seed sets to well-known web spam filtering algorithms. We propose a strategy that makes the best succession of the modified algorithms. We conduct extensive experiments in order to show quality improvement for our work. We compare the original(i.e., well-known) algorithms with the respective modified algorithms. We evaluate the best succession among our modified algorithms. Jan 7, 2011 Contributions
  • 11. Related Work There are two research directions related to the Web spam. Evaluating either the goodness or badness of web nodes [GGP05, KR06]. TrustRank and Anti-TrustRank are well-known algorithms. These two algorithms can be used for refining input seed sets. Detecting spam nodes [GBG06, WD05]. Spam Mass and Link Farm Spam are well-known algorithms. These two algorithms can be used for identifying Web Spam. We classify web spam filtering algorithms into two types of algorithms Seed refinement algorithms (e.g., TrustRank and Anti-Trust Rank). Spam detection algorithms (e.g., Spam Mass and Link Farm Spam). Jan 7, 2011 Note: Existing work exploit web graph whose web node represents a domain [GBG06, WD05] . Related Work
  • 12. TrustRank Overview [GGP04] Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks. Trust scores are propagated through the outlinks of trusted domains. Domains having high trust scores(≥threshold) at the end of propagation are declared as non-spam domains. Example Observation Trust scores can propagate to spam domains if trusted domain outlinks to the spam domains. Jan 7, 2011 Fig. 3: An example for explaining TrustRank. Related Work 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 t( 4)=1/3 A seed non-spam domain t ( i ): The trust score of domain i The domain 3 gets trust scores from the domains 1 and 2. A domain being considered
  • 13. Anti-TrustRank Overview [KR06] Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks. Anti-trust scores are propagated by the inlinks of anti-trusted domains. Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as spam domains. Example Observation Anti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain. Jan 7, 2011 Fig. 4: An example for explaining Anti-TrustRank. Related Work 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 at (4)=1/3 A seed spam domain at ( i ): The anti-trust score of domain i The domain 3 gets anti-trust scores from the domains 1 and 2. A domain being considered
  • 14. Spam Mass Overview [GBG06] A domain is spam if it has excessively high spam score. Spam score is estimated as subtraction from a PageRank score to a non-spam score. Non-spam score is estimated as a trust score computed by TrustRank. Example Observation Since the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does. Jan 7, 2011 Fig. 5: An example for explaining Spam Mass. Related Work 1 2 5 3 A seed non-spam domain A domain being considered The domain 5 receives many inlinks but only one indirect inlink from a non-spam domain. 4 7 6
  • 15. Link Farm Spam Overview [WD05] A domain is spam if it has many bidirectional links with domains. A domain is spam if it has many outlinks pointing to spam domains. Example Observation Link Farm Spam does not take any input seed set. A domain can have many bidirectional links with trusted domains as well. Jan 7, 2011 Related Work Fig. 6: An example for explaining Link Farm Spam. 2 1 3 4 5 A domain being considered The domains 1, 3, and 4 have two directional links.
  • 16. Web Spam Filtering Using Seed Refinement Objectives Decrease the number of domains incorrectly detected as belonging to the class of non-spam domains (called False Positives ). Increase the number of domains correctly detected as belonging to the class of spam domains (called True Positives ). Our approaches We modify the spam filtering algorithms by using both spam and non-spam domains in order to decrease False Positives . We use non-spam domains so that their goodness should not propagate to spam domains. We use spam domains so that their badness should not propagate to non-spam domains. We make the succession of these algorithms in order to increase True Positives. We make the succession of the seed refinement algorithm followed by the spam detection algorithm so that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm. Jan 7, 2011
  • 17. Modified TrustRank Modification Trust score should not propagate to spam domains. Example Jan 7, 2011 Modifications 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 A seed non-spam domain t ( i ): The trust score of domain i The domains 5 and 6 are involved in Web spam. A domain being considered t (5)= 5/12 + … 5 6 4 t (4)=1/3 t (6)= 5/12 + … 5/12 5/12 A seed spam domain Fig. 7: An example explaining Modified TrustRank.
  • 18. Modified Anti-TrustRank Modification Anti-Trust score should not propagate to non-spam domains. Example Jan 7, 2011 Modifications 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 The domains 5 ,6 and 7 are non- spam domains. at (5)=5/12 at (6)=5/12 + … 5 6 a t ( i ): The anti-trust score of domain i A domain being considered A seed spam domain 7 5/12 at (4)=1/3 5/12 5/12 at (7)=5/12 + … A seed non-spam domain Fig. 8: An example explaining Modified Anti-TrustRank.
  • 19. Modified Spam Mass Modification Use modified TrustRank in place of TrustRank. Example Jan 7, 2011 Modifications 1 2 5 3 A seed non-spam domain A domain being considered The domain 5 receives many inlinks 4 7 6 but only one indirect inlink from a non-spam domain. A seed spam domain Fig. 9: An example explaining Modified Spam Mass.
  • 20. Modified Link Farm Spam Modification Use two types (i.e., spam and non-spam domain) of input seed sets. A domain having many bidirectional links with only trusted domains is not detected as a spam domain. Example Jan 7, 2011 Modifications 2 1 3 4 5 A domain being considered The domains 1, 3, and 4 have two directional links. Fig. 10: An example explaining Modified Link Farm Spam. A seed non-spam domain 6 8 7
  • 21. Strategy to Make Succession of Modified Algorithms Overview We make the succession of the seed refinement algorithms (simply, Seed Refiner) followed by the spam detection algorithms (simply, Spam Detector). We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively. Jan 7, 2011 Strategy Seed Refiner Spam Detector Detected spam domains Class Data flow Refined spam and non-spam domains Manually labeled spam and non-spam domains Fig. 11: The strategy of succession. Strategy Consideration of the execution order in Seed Refiner. Modified TrustRank followed by Modified Anti-TrustRank. Modified Anti-TrustRank followed by Modified TrustRank. Consideration of the execution order in Spam Detector. Modified Spam Mass followed by Modified Link Farm Spam. Modified Link Farm Spam followed by Modified Spam Mass.
  • 22. Performance Evaluation Purpose Show the effect of seed refinement on the quality of web spam filtering. Show the effect of succession on the quality of web spam filtering. Experiments We conduct two sets of the experiments according to the two purposes as mentioned above. Jan 7, 2011 Performance Evaluation Table. 1: Summary of the experiments. Experimen tal Sets Experiment s Parameters Set 1: Comparisons for showing the effect of refining seed Exp . 1 Comparison between TR (TrustRank) and MTR (Modified TrustRank) cutoff Tr 0% − 300% ratio Top 10%, 50%, 100% damp 0.85 Exp . 2 Comparison between ATR (Anti-TrustRank) and MATR (Modified Anti-TrustRank) cutoff ATr 0% − 300% ratio Top 10%, 50%, 100% damp 0.85 Exp . 3 Comparison between SM (Spam Mass) and MSM (Modified Spam Mass) relativeMass 0.7 − 1.0 topPR 10%, 50%, 100% damp 0.85 Exp . 4 Comparison between LFS (Link Farm Spam) and MLFS (Modified Link Farm Spam) limitBL 2 − 7 limitOL 2 − 7 Set 2: Comparisons for showing the effect of ordering executions Exp . 5 Finding the best succession for the seed refiner cutoff Tr 50%, 75%, 100% cutoff ATr 100% damp 0.85 Exp . 6 Finding the best succession for the spam detector relativeMass 0.8 − 0.99 topPR 100% limitBL 7 limitOL 7 damp 0.85 Exp . 7 Comparison among the best succession, the best known algorithm, and best modified algorithm relativeMass 0.8 − 0.99 topPR 100% limitBL 7 limitOL 7 damp 0.85
  • 23. Experimental Parameters Jan 7, 2011 Table. 2: Parameters used in experiments. Performance Evaluation Parameters Description damp It is a parameter used in TR , MTR , ATR , and MATR . It is the probability of following an outlink. Ratio Top It is the ratio for determining the input seed sets in TR , MTR , ATR , and MATR . Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top- Ratio top % domain in the entire domains, and then, use the domains as the input seed set. cutoff Tr It is the cutoff threshold in TR and MTR for declaring the number of non-spam domains. In this thesis, we decide the value of cutoff Tr proportional to the size of input seed set of the non-spam domains. cutoff ATr It is the cutoff threshold in ATR and MATR for declaring the number of spam domains. In this thesis, we decide the value of cutoff ATr proportional to the size of input seed set of the spam domains. relativeMass It is a threshold used in SM and MSM for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam. topPR It is a threshold used in SM and MSM for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e., topPR %) of the PageRank scores. limitBL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold. limiOL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.
  • 24. Experimental Data Experimental data [BCD08] [CDB06] [CDG07] Jan 7, 2011 Performance Evaluation Table. 3: Characteristics of the data set in terms of domains and web pages . Table. 4: Classification of the data set as Seed Set and Test Set . Domains Web Pages Labeled Spam 1,924 Total 77.9 Million Non-Spam 5,549 Unlabeled Unknown 3,929 Total 11,402 Seed Set Test Set Labeled Spam Domains 674 1,250 Labeled Non-Spam Domains 4,948 601
  • 25. Jan 7, 2011 Experimental Measure Performance Evaluation Table. 5: Description of the measures. 1 False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam). 1 Measures Description True positives The number of domains correctly labeled as belonging to the class (i.e., spam or non-spam). [BCD08] False positives The number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam). [BCD08] F - measure The combined representation of precision and recall . Precision, recall [SM86] , and F - measure are expressed as follows. –
  • 26. Comparison between Original and Modified Algorithms (1/3) Experiment 1: Comparison Between TR and MTR MTR performs either comparable to or slightly better than TR in terms of both true positives and false positives. We find cutoff Tr effective till 100% mark indicating that after 100% detection becomes unstable in terms of false positives.  For later experiments, we fix the cutoff Tr range till 100%. Experiment 2: Comparison Between ATR and MATR MATR generally performs better than ATR in terms of true positives We find cutoff ATr effective till 180% mark indicating that after 100% detection becomes unstable in terms of false positives.  For later experiments, we fix the cutoff ATr at 100% to ensure high precision. Jan 7, 2011 Performance Evaluation
  • 27. Comparison between Original and Modified Algorithms (2/3) Experiment 3: Comparison Between SM and MSM MSM performs slightly better than SM in terms of true positives and comparable in terms of false positives We find relativeMass effective between the range of 0.95 to 0.99 in terms of maximizing true positives and minimizing false positives .  For later experiments, we keep the range from 0.8 to 0.99 of relativeMass as effective range. Experiment 4: Comparison Between LFS and MLFS MLFS performs better than LFS in terms of false positives while at some expense of true positives . We find limitBL and limitOL highly effective at 7 and 7 respectively in terms of minimizing many false positives .  For later experiments, we keep limitBL = 7 and limitOL = 7. Jan 7, 2011 Performance Evaluation
  • 28. Comparison between Original and Modified Algorithms (3/3) Summary We have found all modified algorithms providing better quality than the respective original algorithms. We found SM as the best original web spam detection algorithms among ATR , SM , and LFS algorithms due to high true positives and relatively less false positives . We also found MSM as the best modified web spam detection algorithms among MATR , MSM , and MLFS algorithms due to high true positives and relatively less false positives . Jan 7, 2011 Performance Evaluation
  • 29. The Best Succession for the Seed Refiner Jan 7, 2011 Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner. Performance Evaluation Identical performance for both successions Identical performance for both successions Identical performance for both successions Better performance for MATR-MTR compared to MTR-MATR Table. 6: Comparison for the seed refiner. True Positives False Positives For Finding Refined Non-Spam Domains For Finding Refined Spam Domains
  • 30. The Best Succession for the Spam Detector Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other values of relativeMass while true positives are almost comparable for all values of relativeMass . We observe MLFS fails to detect considerable number of spam domains. We obtain the precisions 0.86, 0.86, 0.93, and 0.87 for MLFS-MSM , MSM-MLFS , MLFS , and MSM respectively. We obtain the recalls 0.80, 0.80, 0.33, and 0.76 for MLFS-MSM , MSM-MLFS , MLFS , and MSM respectively. MLFS-MSM and MSM-MLFS are best and identical in performance, we choose MLFS-MSM as the best spam detector without loss of generality. Jan 7, 2011 Performance Evaluation Fig. 12: Comparison for the spam detector.
  • 31. Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other values of relativeMass while true positives are almost comparable for all values of relativeMass . We observe MATR-MTR-MLFS-MSM finds more true positives and some more false positives. We obtain the precisions 0.85, 0.86, and 0.86 for SM , MSM , and MATR-MTR-LFS-MSM respectively. We obtain the recalls 0.64, 0.70, and 0.80 for SM , MSM , and MATR-MTR-LFS-MSM respectively. Comparison among the Best Succession, the Best Known Algorithm and the Best Modified Algorithm Jan 7, 2011 Fig. 13: Comparison among MATR-MTR-MLFS-MSM , SM , and MSM . Therefore, MATR-MTR-MLFS-MSM is more effective.
  • 32. Conclusions We have improved the quality of web spam filtering by using seed refinement We have proposed modifications in four well-known web spam filtering algorithms. We have proposed a strategy of succession of modified algorithms Seed Refiner contains order of executions for seed refinement algorithms. Spam Detector contains order of executions for spam detection algorithms. We have conducted extensive experiments in order to show the effect of seed refinement on the quality of web spam filtering We find that every modified algorithm performs better than the respective original algorithm. We find the best performance among the successions by MATR followed by MTR , MLFS, and MSM (i.e., MATR-MTR-MSM ). This succession outperforms the best original algorithm i.e., SM , by up to 1.25 times in recall and is comparable in terms of precision . Jan 7, 2011
  • 33. References (1/2) [ACG01] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S., “Searching the Web,” ACM Transactions on Internet Technology (TOIT) , Vol. 1, No. 1, pp. 2-43, Aug. 2001. [BP98] Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proc. 7th Int'l Conf. on World Wide Web (WWW) , pp. 107-117, Brisbane, Australia, Apr. 1998. [BCD08] Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., and Leonardi, S., “Link Analysis for Web Spam Detection,” ACM Transactions on Web (TWEB) , Vol. 2, No. 1, pp. 1-42, Mar. 2008. [CDB06] Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S., “A Reference Collection for Web Spam,” SIGIR Forum , Vol. 40, No. 2, pp. 11-24, Dec. 2006. [CDG07] Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F, “Know Your Neighbors: Web Spam Detection Using the Web Topology,” In Proc. 30th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval , pp. 423-430, Amsterdam, The Netherlands, July 2007. [GG05] Gyongyi, Z., Berkhin, P., and Garcia-Molina, H., “Web Spam Taxonomy,” In Proc. 1st Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb) , pp. 39-47, Chiba, Japan, May 2005. [GBG06] Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J., “Link Spam Detection Based on Mass Estimation,” In Proc. 32th Int'l Conf. on Very Large Data Bases (VLDB) , pp. 439-450, Seoul, Korea, Sept. 2006. [GGP04] Gyongyi, Z., Garcia-Molina, H., and Jan, P., “Combating Web Spam with TrustRank,” In Proc. 30th Int'l Conf. on Very Large Data Bases (VLDB) , pp. 576-587, Toronto, Canada, Aug. 2004. Jan 7, 2011
  • 34. References (2/2) [KR06] Krishnan, V. and Raj, R., “Web Spam Detection with Anti-TrustRank,” In Proc. 2nd Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb) , pp. 37-40, Washington, USA, Aug. 2006. [WD05] Wu, B. and Davison, B., “Identifying Link Farm Spam Pages,” In Proc. Special Interest Tracks and Posters of the 14th Int'l Conf. on World Wide Web (WWW) , pp. 820-829, Chiba, Japan, May 2005. [SM86] Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval , McGraw-Hill, 1986. [YUT08] Yoshida, Y., Ueda, T., Tashiro, T., Hirate, Y., and Yamana, “What's Going on in Search Engine Rankings,” In Proc. 22nd Int'l Conf. on Advanced Information Networking and Applications (AINAW) , pp. 1199 - 1204, Okinawa, Japan, Mar. 2008. Jan 7, 2011
  • 35. THANK YOU VERY MUCH! Jan 7, 2011
  • 36. MTR Algorithm Jan 7, 2011 Supplement
  • 37. MATR Algorithm Jan 7, 2011 Supplement
  • 38. MSM Algorithm Jan 7, 2011 Supplement
  • 39. MLFS Algorithm Jan 7, 2011 Supplement
  • 40. TR vs. MTR Jan 7, 2011 Supplement Ratio Top = 10% Ratio Top =5 0% Ratio Top = 100% (a) (b) (c) (d) (e) (f)
  • 41. ATR vs. MATR Jan 7, 2011 Supplement Ratio Top = 10% Ratio Top =5 0% Ratio Top = 100% (a) (b) (c) (d) (e) (f)
  • 42. SM vs. MSM Jan 7, 2011 Supplement topPR =7 0% topPR =85 % topPR =10 0% (a) (b) (c) (d) (e) (f)
  • 43. LFS vs. MLFS Jan 7, 2011 (a) (b) Supplement
  • 44. MSM performs better than the rest due to the minimization of False Positives while almost comparable to best in terms of True Positives . The Best Succession for the Spam Detector Jan 7, 2011 Fig x: Comparison for the spam detector The winner is MSM for Spam Detector. Supplement
  • 45. MATR-MTR-MSM performs better than both SM and MSM . The MATR-MTR-MSM finds more True Positives than these two algorithms with comparable False Positives . Comparison among the Best Succession, the Best Known Algorithm and Best Modified Algorithm Jan 7, 2011 MATR-MTR-MSM is very effective compared to best known algorithm. Supplement
  • 46. Possible Combinations for Seed Refinement Module Jan 7, 2011 Supplement Succession 1 ( MATR-MTR ) Succession 2 ( MTR-MATR ) MATR MTR Manual spam and non-spam seed domains Manual non-spam domains and refined spam domains Manual spam and non-spam seed domains MTR MATR Refined spam and non-spam seed domains Refined spam and non-spam seed domains Manual spam domains and refined non-spam domains Seed Refiner Seed Refiner Algorithm Class Data flow
  • 47. Possible Combinations for Spam Detection Module Jan 7, 2011 Supplement Combinations Single Algorithm MLFS-MSM MSM-MLFS MLFS MSM Succession 1 ( MLFS-MSM ) Succession 2 ( MSM-MLFS ) MLFS MSM Refined spam/non-spam seed domains Spam domains and refined non-spam domains Refined spam/non-spam seed domains MSM MLFS Detected spam domains Detected spam domains Spam domains and refined non-spam domains Spam Detector Spam Detector Algorithm Class Data flow
  • 48. TR and ATR problem Jan 7, 2011 Supplement 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 A seed non-spam domain t ( i ): The trust score of domain i The domains 5 and 6 are involved in Web spam. A domain being considered t (5)= 5/12 + … 5 6 4 t (4)=1/3 t (6)= 5/12 + … 5/12 5/12 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 The domains 5 ,6 and 7 are non- spam domains. at (5)=5/12 at (6)=5/12 + … 5 6 a t ( i ): The anti-trust score of domain i A domain being considered A seed spam domain 7 5/12 at (4)=1/3 5/12 5/12 at (7)=5/12 + …

×