SlideShare a Scribd company logo
CSE509: Introduction to Web Science and Technology Lecture 3: The Structure of the Web, Link Analysis and Web Search Muhammad AtifQureshi and ArjumandYounus Web Science Research Group Institute of Business Administration (IBA)
Last Time… Basic Information Retrieval Approaches Bag of Words Assumption Information Retrieval Models Boolean model Vector-space model Topic/Language models July 23, 2011
Today Search Engine Architecture Overview of Web Crawling Web Link Structure Ranking Problem SEO and Web Spam Web Spam Research July 23, 2011
Introduction World Wide Web has evolved from a handful of pages to billions of pages In January 2008, Google reported indexing 30 billion pages and Yahoo 37 billion. In this huge amount of data, search engines play a significant role in finding the needed information Search engines consist of the following basic operations Web crawling Ranking Keyword extraction Query processing July 23, 2011
General Architecture of a Web Search Engine Web User Query Crawler Indexing Visual  Interface Index Ranking Query Operations July 23, 2011
CRAWLING MODULE July 23, 2011
Web Crawler Definition Program that collects Web pages by recursively fetching links (i.e., URLs) starting from a set of seed pages [HN99] Objective Acquisition of large collections of Web pages to be indexed by the search engine for efficient execution of user queries July 23, 2011 Introduction
Basic Crawler Operation Place known seed URLs in the URL queue Repeat following steps until a threshold number of pages downloaded Fetch a URL on the URL queue and download the corresponding Web page For each downloaded Web page Extract URLs from the Web page  For each extracted URL, check validity and availability of URL using checking modules  Place the URLs that pass the checks on the URL queue July 23, 2011 New URLs Seed URLs NOTATIONS USED : queue : module :  data flow  URLs to crawl Web pages Checking module Extracted  URLs URL duplication check Web page  downloader URL queue Link extractor DNS resolver Crawled  Web pages URLs to crawl Robots check Web
Crawling Issues Load at visited Web sites Load at crawler Scope of crawl Incremental crawling July 23, 2011
RANKING MODULE July 23, 2011
Problems of TFIDF Vector Works well on small controlled corpus, but not on the Web Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times American Airlines mentioned? Easy to spam Ranking purely based on page content Authors can manipulate page content to get high ranking Any idea? July 23, 2011
Web Page Ranking Motivation 	User queries return huge amount of relevant web pages, but the users want to browse the most important ones Note: Relevancerepresents that a web page matches the user’s query Concept 	Ordering the relevant web pages according to their importance Note: Importance represents the interest of a user on the relevant web pages Methods Link-based method: exploiting the link structure of web for ordering the search results Content-based method: exploiting the contents of web pages for ordering the search results July 23, 2011
Link Structure of Web Concept Web can be modeled as a graph G(V, E) where V is a set of vertices representing web nodes, and E is a set of edges representing directed links between the nodes. Note: Web node represents either a web page or a web domain.          Links are classifed into two classes as follows: The link structure is called web graph. Example ,[object Object]
Outlink: the outgoing link from a web node.V = {A, B, C} E = {AB, BC} AB is an outlink of the web node A. BC is an outlink of the web node B. AB is an inlink of the web node B. BC is an inlink of the web node C. B C A Fig. 1: An example of a web graph. July 23, 2011
PageRank: Basic Idea Think of …. People as pages Recommendations as links Therefore,  “Pages are popular, if popular pages link them” 	“PageRank is a global ranking of all Web pages regardless of their content, based solely on their location in the Web’s graph structure” [Page et al 1998]  July 23, 2011
PageRank Overview A web page is more important if it is pointed by many other important web pages The importance of a web page (called PageRank value) represents the probability that a user visits the web page Function July 23, 2011 web page important web page D link C E random jump from F to B A user F B jump to a random page < User’s behavior on the web graph > PR[p]:  PageRank value of web page p   Nolink(q):  number of outlinks of web page q d:  damping factor (probability of following a link) v[p]:  probability that a user randomly jumps to web page p  (random jump value over web page p)
PageRank Example July 23, 2011 1 2 3 4
PageRank: Problems on the Real Web Dangling nodes A page with no links to send importance All importance “leak out of” the Web Solution: Random surfer model Crawler trap A group of one or more pages that have no  links out of the group Accumulate all the importance of the Web Solution: Damping factor July 23, 2011
Link Analysis in Modern Web Search PageRank like ideas play basic role in the ranking functions of Google, Yahoo! And Bing Current ranking functions far from pure PageRank Far more complex Evolve all the time Kept in secret! July 23, 2011
Search Engine Optimization Important game-theoretic principle: the world reacts and adapts to the rules Web page authors create their Web pages with the search engine’s ranking formula in mind July 23, 2011
A Huge Challenge for Today’s Search Engines SEO gives birth to nuisance of Web spam July 23, 2011
Web Spam Concept 	Any deliberate action in order to boost a web node’s rank, without improving its real merit. Link spam: web spam against link-based methods An action that changes the link structure of web in order to boost web node's ranking. Example N1 N2 I want to boost the rank of the web node N3 The web nodes N1and N2 are not involved in link spam, so they care called non-spam nodes N4 Actor creates the web node N3 to Nx N3 N5 Web nodes N3-Nx are involved in link spam, so they are called spam nodes … Nx Node	             Link                    Actor Fig. 2: An example of link spam. July 23, 2011
TrustRank Overview [GGP04] Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks. Trust scores are propagated through the outlinks of trusted domains. Domains having high trust scores(≥threshold) at the end of propagation are declared as non-spam domains. Example Observation 	Trust scores can propagate to spam domains if trusted domain outlinks to the spam domains. 1/2 A domain being considered 5/12 1 1/2 3 5/12 t(1)=1 A seed non-spam domain 1/3 t(3)=5/6 1/3 t(i): The trust score of domain i 2 4 t(2)=1 The domain 3 gets trust scores from the domains 1 and 2. 1/3 t(4)=1/3 Fig. 3: An example for explaining TrustRank. July 23, 2011
Anti-TrustRank Overview [KR06] Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks. Anti-trust scores are propagated by the inlinks of anti-trusted domains. Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as spam domains. Example Observation 	Anti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain. 1/2 A domain being considered 5/12 1 1/2 A seed spam domain 3 5/12 at(1)=1 1/3 at(3)=5/6 at(i): The anti-trust score of domain i 2 1/3 4 The domain 3 gets anti-trust scores from the domains 1 and 2. at(2)=1 at(4)=1/3 1/3 Fig. 4: An example for explaining Anti-TrustRank. July 23, 2011
Spam Mass Overview [GBG06] A domain is spam if it has excessively high spam score. Spam score is estimated as subtraction from a PageRank score to a non-spam score. Non-spam score is estimated as a trust score computed by TrustRank. Example Observation Since the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does. 1 A domain being considered 2 7 5 6 A seed non-spam domain 3 4 Fig. 5: An example for explaining Spam Mass. The domain 5 receives many inlinks but only one  indirect inlink from a non-spam domain. July 23, 2011
Link Farm Spam Overview[WD05] A domain is spam if it has many bidirectional links with domains. A domain is spam if it has many outlinks pointing to spam domains. Example Observation Link Farm Spam does not take any input seed set. A domain can have many bidirectional links with trusted domains as well. 2 1 3 A domain being considered 4 5 The domains 1, 3, and 4 have two directional links. Fig. 6: An example for explaining Link Farm Spam. July 23, 2011
RESEARCH SECTION July 23, 2011
Web Spam Filtering Algorithm Overview The web spam filtering algorithms output spam nodes to be filtered out [GBG06]. In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05]. Spam input seed set: the input seed set containing spam nodes. Non-spam input seed set: the input seed set containing non-spam nodes. The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06]. Observation The output quality of web spam filtering algorithms is dependent on that of the input seed sets. The output of the one web spam filtering algorithm can be used as the input of the other web spam filtering algorithm.       The algorithms may support one another if placed in appropriate succession. July 23, 2011
Motivation and Goal Motivation There is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms. There is no well-known study on successions among web spam filtering algorithms. Goal	 Improving the quality of web spam filtering by using seed refinement. Improving the quality of web spam filtering by finding the appropriate succession among web spam filtering algorithms. July 23, 2011
Contributions We propose modified algorithms that apply seed refinement techniques using both spam and non-spam input seed sets to well-known web spam filtering algorithms. We propose a strategy that makes the best succession of the modified algorithms. We conduct extensive experiments in order to show quality improvement for our work. We compare the original(i.e., well-known) algorithms with the respective modified algorithms. We evaluate the best succession among our modified algorithms. July 23, 2011
Web Spam Filtering Using Seed Refinement Objectives Decrease the number of domains incorrectly detected as belonging to the class of non-spam domains (called False Positives). Increase the number of domains correctly detected as belonging to the class of spam domains (called True Positives). Our approaches We modify the spam filtering algorithms by using both spam and non-spam domains in order to decrease False Positives. We use non-spam domains so that their goodness should not propagate to spam domains. We use spam domains so that their badness should not propagate to non-spam domains. We make the succession of these algorithms in order to increase True Positives. We make the succession of the seed refinement algorithm followed by the spam detection algorithm so that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm. July 23, 2011
Modified TrustRank Modification 	Trust score should not propagate to spam domains. Example 5/12 1/2 A seed spam domain 5/12 5 6 1 1/2 A domain being considered 3 t(6)=5/12 + … t(5)=5/12 + … t(1)=1 5/12 t(3)=5/6 1/3 A seed non-spam domain 1/3 2 5/12 t(i): The trust score of domain i 4 t(2)=1 The domains 5 and 6 are involved in Web spam. 1/3 t(4)=1/3 Fig. 7: An example explaining Modified TrustRank. July 23, 2011
Modified Anti-TrustRank Modification 	Anti-Trust score should not propagate to non-spam domains. Example 5/12 A seed spam domain at(5)=5/12 1/2 3 1 5 A domain being considered 7 5/12 1/2 5/12 5/12 at(1)=1 at(3)=5/6 5/12 6 at(7)=5/12 + … A seed non-spam domain 1/3 4 at(6)=5/12 + … 2 1/3   at(i): The anti-trust score of domain i at(2)=1 at(4)=1/3 1/3 The domains 5 ,6 and 7 are non- spam domains. Fig. 8: An example explaining Modified Anti-TrustRank. July 23, 2011
Modified Spam Mass Modification 	Use modified TrustRank in place of TrustRank. Example A seed spam domain 1 A domain being considered 2 5 7 6 A seed non-spam domain 3 4 The domain 5 receives many inlinks but only one  indirect inlink from a non-spam domain. Fig. 9: An example explaining Modified Spam Mass. July 23, 2011
Modified Link Farm Spam Modification Use two types (i.e., spam and non-spam domain) of input seed sets. A domain having many bidirectional links with only trusted domains is not detected as a spam domain. Example 6 8 2 7 A seed non-spam domain 1 3 A domain being considered 4 5 The domains 1, 3, and 4 have two directional links. Fig. 10: An example explaining Modified Link Farm Spam. July 23, 2011
Modified Link Farm Spam Overview We make the succession of the seed refinement algorithms (simply, Seed Refiner) followed by the spam detection algorithms (simply, Spam Detector). We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively. ,[object Object]
Consideration of the execution order in Seed Refiner.
Modified TrustRank followed by Modified Anti-TrustRank.
Modified Anti-TrustRank followed by Modified TrustRank.
Consideration of the execution order in Spam Detector.
Modified Spam Mass followed by Modified Link Farm Spam.
Modified Link Farm Spam followed by Modified Spam Mass.Manually labeled spam and non-spam domains Seed Refiner Refined  spam and non-spam  domains Spam Detector Detected  spam domains Class Data flow Fig. 11: The strategy of succession. July 23, 2011
Performance Evaluation Purpose Show the effect of seed refinement on the quality of web spam filtering. Show the effect of succession on the quality of web spam filtering. Experiments  We conduct two sets of the experiments according to the two purposes as mentioned above. Table. 1: Summary of the experiments. July 23, 2011
Experimental Parameters Table. 2: Parameters used in experiments. July 23, 2011
[object Object],Experimental Data Table. 3: Characteristics of the data set in terms of domains and web pages. Table. 4: Classification of the data set as Seed Set and Test Set. July 23, 2011
Experimental Measure Table. 5: Description of the measures. 1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam). July 23, 2011
Comparison between Originaland Modified Algorithms (1/3) ,[object Object]

More Related Content

What's hot

Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
Margaret Wang
 
Link Analysis (RBY)
Link Analysis (RBY)Link Analysis (RBY)
Link Analysis (RBY)
Carlos Castillo (ChaTo)
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
Aykut Aslantaş
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
Kundan Bhaduri
 
WEB Data Mining
WEB Data MiningWEB Data Mining
WEB Data Mining
Oases Ong
 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web Design
Brian Hogan
 
page ranking algorithm
page ranking algorithmpage ranking algorithm
page ranking algorithm
Javed Khan
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
Ankit Raj
 
Page rank and hyperlink
Page rank and hyperlink Page rank and hyperlink
Page rank and hyperlink
Silicon
 
Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks   Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks
Bohdan Pavlyshenko
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
M. Atif Qureshi
 

What's hot (11)

Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
 
Link Analysis (RBY)
Link Analysis (RBY)Link Analysis (RBY)
Link Analysis (RBY)
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
 
WEB Data Mining
WEB Data MiningWEB Data Mining
WEB Data Mining
 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web Design
 
page ranking algorithm
page ranking algorithmpage ranking algorithm
page ranking algorithm
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
Page rank and hyperlink
Page rank and hyperlink Page rank and hyperlink
Page rank and hyperlink
 
Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks   Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
 

Viewers also liked

Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
Naga Gopinath
 
ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for PakistanReThinking CS Curriculum for Pakistan
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
Emrullah Delibas
 
Welcoming Webology
Welcoming WebologyWelcoming Webology
Welcoming Webology
M. Atif Qureshi
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
Link Analysis for Web Information Retrieval
Link Analysis for Web Information RetrievalLink Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
Carlos Castillo (ChaTo)
 
Information Retrieval
Information RetrievalInformation Retrieval

Viewers also liked (8)

Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for PakistanReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for Pakistan
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
 
Welcoming Webology
Welcoming WebologyWelcoming Webology
Welcoming Webology
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
Link Analysis for Web Information Retrieval
Link Analysis for Web Information RetrievalLink Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 

Similar to CSE509 Lecture 3

Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
IOSR Journals
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
James Arnold
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
James Arnold
 
I04015559
I04015559I04015559
Web mining
Web miningWeb mining
Web mining
Rashmi Bhat
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and Features
Waqas Tariq
 
TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
ssuser7a8460
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
Raja R
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
anchalsinghdm
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
binit singh
 
Discovering knowledge using web structure mining
Discovering knowledge using web structure miningDiscovering knowledge using web structure mining
Discovering knowledge using web structure mining
Atul Khanna
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
Zac Darcy
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
dannyijwest
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IJwest
 
Seo competitive analysis
Seo competitive analysisSeo competitive analysis
Seo competitive analysis
Brian Bateman
 
Analysis of websites as graphs for SEO
Analysis of websites as graphs for SEOAnalysis of websites as graphs for SEO
Analysis of websites as graphs for SEO
Rubén Martínez
 

Similar to CSE509 Lecture 3 (20)

Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
I04015559
I04015559I04015559
I04015559
 
Web mining
Web miningWeb mining
Web mining
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and Features
 
TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
Discovering knowledge using web structure mining
Discovering knowledge using web structure miningDiscovering knowledge using web structure mining
Discovering knowledge using web structure mining
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 
Seo competitive analysis
Seo competitive analysisSeo competitive analysis
Seo competitive analysis
 
Analysis of websites as graphs for SEO
Analysis of websites as graphs for SEOAnalysis of websites as graphs for SEO
Analysis of websites as graphs for SEO
 

Recently uploaded

Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 

Recently uploaded (20)

Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 

CSE509 Lecture 3

  • 1. CSE509: Introduction to Web Science and Technology Lecture 3: The Structure of the Web, Link Analysis and Web Search Muhammad AtifQureshi and ArjumandYounus Web Science Research Group Institute of Business Administration (IBA)
  • 2. Last Time… Basic Information Retrieval Approaches Bag of Words Assumption Information Retrieval Models Boolean model Vector-space model Topic/Language models July 23, 2011
  • 3. Today Search Engine Architecture Overview of Web Crawling Web Link Structure Ranking Problem SEO and Web Spam Web Spam Research July 23, 2011
  • 4. Introduction World Wide Web has evolved from a handful of pages to billions of pages In January 2008, Google reported indexing 30 billion pages and Yahoo 37 billion. In this huge amount of data, search engines play a significant role in finding the needed information Search engines consist of the following basic operations Web crawling Ranking Keyword extraction Query processing July 23, 2011
  • 5. General Architecture of a Web Search Engine Web User Query Crawler Indexing Visual Interface Index Ranking Query Operations July 23, 2011
  • 7. Web Crawler Definition Program that collects Web pages by recursively fetching links (i.e., URLs) starting from a set of seed pages [HN99] Objective Acquisition of large collections of Web pages to be indexed by the search engine for efficient execution of user queries July 23, 2011 Introduction
  • 8. Basic Crawler Operation Place known seed URLs in the URL queue Repeat following steps until a threshold number of pages downloaded Fetch a URL on the URL queue and download the corresponding Web page For each downloaded Web page Extract URLs from the Web page For each extracted URL, check validity and availability of URL using checking modules Place the URLs that pass the checks on the URL queue July 23, 2011 New URLs Seed URLs NOTATIONS USED : queue : module : data flow URLs to crawl Web pages Checking module Extracted URLs URL duplication check Web page downloader URL queue Link extractor DNS resolver Crawled Web pages URLs to crawl Robots check Web
  • 9. Crawling Issues Load at visited Web sites Load at crawler Scope of crawl Incremental crawling July 23, 2011
  • 11. Problems of TFIDF Vector Works well on small controlled corpus, but not on the Web Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times American Airlines mentioned? Easy to spam Ranking purely based on page content Authors can manipulate page content to get high ranking Any idea? July 23, 2011
  • 12. Web Page Ranking Motivation User queries return huge amount of relevant web pages, but the users want to browse the most important ones Note: Relevancerepresents that a web page matches the user’s query Concept Ordering the relevant web pages according to their importance Note: Importance represents the interest of a user on the relevant web pages Methods Link-based method: exploiting the link structure of web for ordering the search results Content-based method: exploiting the contents of web pages for ordering the search results July 23, 2011
  • 13.
  • 14. Outlink: the outgoing link from a web node.V = {A, B, C} E = {AB, BC} AB is an outlink of the web node A. BC is an outlink of the web node B. AB is an inlink of the web node B. BC is an inlink of the web node C. B C A Fig. 1: An example of a web graph. July 23, 2011
  • 15. PageRank: Basic Idea Think of …. People as pages Recommendations as links Therefore, “Pages are popular, if popular pages link them” “PageRank is a global ranking of all Web pages regardless of their content, based solely on their location in the Web’s graph structure” [Page et al 1998] July 23, 2011
  • 16. PageRank Overview A web page is more important if it is pointed by many other important web pages The importance of a web page (called PageRank value) represents the probability that a user visits the web page Function July 23, 2011 web page important web page D link C E random jump from F to B A user F B jump to a random page < User’s behavior on the web graph > PR[p]: PageRank value of web page p Nolink(q): number of outlinks of web page q d: damping factor (probability of following a link) v[p]: probability that a user randomly jumps to web page p (random jump value over web page p)
  • 17. PageRank Example July 23, 2011 1 2 3 4
  • 18. PageRank: Problems on the Real Web Dangling nodes A page with no links to send importance All importance “leak out of” the Web Solution: Random surfer model Crawler trap A group of one or more pages that have no links out of the group Accumulate all the importance of the Web Solution: Damping factor July 23, 2011
  • 19. Link Analysis in Modern Web Search PageRank like ideas play basic role in the ranking functions of Google, Yahoo! And Bing Current ranking functions far from pure PageRank Far more complex Evolve all the time Kept in secret! July 23, 2011
  • 20. Search Engine Optimization Important game-theoretic principle: the world reacts and adapts to the rules Web page authors create their Web pages with the search engine’s ranking formula in mind July 23, 2011
  • 21. A Huge Challenge for Today’s Search Engines SEO gives birth to nuisance of Web spam July 23, 2011
  • 22. Web Spam Concept Any deliberate action in order to boost a web node’s rank, without improving its real merit. Link spam: web spam against link-based methods An action that changes the link structure of web in order to boost web node's ranking. Example N1 N2 I want to boost the rank of the web node N3 The web nodes N1and N2 are not involved in link spam, so they care called non-spam nodes N4 Actor creates the web node N3 to Nx N3 N5 Web nodes N3-Nx are involved in link spam, so they are called spam nodes … Nx Node Link Actor Fig. 2: An example of link spam. July 23, 2011
  • 23. TrustRank Overview [GGP04] Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks. Trust scores are propagated through the outlinks of trusted domains. Domains having high trust scores(≥threshold) at the end of propagation are declared as non-spam domains. Example Observation Trust scores can propagate to spam domains if trusted domain outlinks to the spam domains. 1/2 A domain being considered 5/12 1 1/2 3 5/12 t(1)=1 A seed non-spam domain 1/3 t(3)=5/6 1/3 t(i): The trust score of domain i 2 4 t(2)=1 The domain 3 gets trust scores from the domains 1 and 2. 1/3 t(4)=1/3 Fig. 3: An example for explaining TrustRank. July 23, 2011
  • 24. Anti-TrustRank Overview [KR06] Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks. Anti-trust scores are propagated by the inlinks of anti-trusted domains. Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as spam domains. Example Observation Anti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain. 1/2 A domain being considered 5/12 1 1/2 A seed spam domain 3 5/12 at(1)=1 1/3 at(3)=5/6 at(i): The anti-trust score of domain i 2 1/3 4 The domain 3 gets anti-trust scores from the domains 1 and 2. at(2)=1 at(4)=1/3 1/3 Fig. 4: An example for explaining Anti-TrustRank. July 23, 2011
  • 25. Spam Mass Overview [GBG06] A domain is spam if it has excessively high spam score. Spam score is estimated as subtraction from a PageRank score to a non-spam score. Non-spam score is estimated as a trust score computed by TrustRank. Example Observation Since the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does. 1 A domain being considered 2 7 5 6 A seed non-spam domain 3 4 Fig. 5: An example for explaining Spam Mass. The domain 5 receives many inlinks but only one indirect inlink from a non-spam domain. July 23, 2011
  • 26. Link Farm Spam Overview[WD05] A domain is spam if it has many bidirectional links with domains. A domain is spam if it has many outlinks pointing to spam domains. Example Observation Link Farm Spam does not take any input seed set. A domain can have many bidirectional links with trusted domains as well. 2 1 3 A domain being considered 4 5 The domains 1, 3, and 4 have two directional links. Fig. 6: An example for explaining Link Farm Spam. July 23, 2011
  • 28. Web Spam Filtering Algorithm Overview The web spam filtering algorithms output spam nodes to be filtered out [GBG06]. In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05]. Spam input seed set: the input seed set containing spam nodes. Non-spam input seed set: the input seed set containing non-spam nodes. The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06]. Observation The output quality of web spam filtering algorithms is dependent on that of the input seed sets. The output of the one web spam filtering algorithm can be used as the input of the other web spam filtering algorithm.  The algorithms may support one another if placed in appropriate succession. July 23, 2011
  • 29. Motivation and Goal Motivation There is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms. There is no well-known study on successions among web spam filtering algorithms. Goal Improving the quality of web spam filtering by using seed refinement. Improving the quality of web spam filtering by finding the appropriate succession among web spam filtering algorithms. July 23, 2011
  • 30. Contributions We propose modified algorithms that apply seed refinement techniques using both spam and non-spam input seed sets to well-known web spam filtering algorithms. We propose a strategy that makes the best succession of the modified algorithms. We conduct extensive experiments in order to show quality improvement for our work. We compare the original(i.e., well-known) algorithms with the respective modified algorithms. We evaluate the best succession among our modified algorithms. July 23, 2011
  • 31. Web Spam Filtering Using Seed Refinement Objectives Decrease the number of domains incorrectly detected as belonging to the class of non-spam domains (called False Positives). Increase the number of domains correctly detected as belonging to the class of spam domains (called True Positives). Our approaches We modify the spam filtering algorithms by using both spam and non-spam domains in order to decrease False Positives. We use non-spam domains so that their goodness should not propagate to spam domains. We use spam domains so that their badness should not propagate to non-spam domains. We make the succession of these algorithms in order to increase True Positives. We make the succession of the seed refinement algorithm followed by the spam detection algorithm so that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm. July 23, 2011
  • 32. Modified TrustRank Modification Trust score should not propagate to spam domains. Example 5/12 1/2 A seed spam domain 5/12 5 6 1 1/2 A domain being considered 3 t(6)=5/12 + … t(5)=5/12 + … t(1)=1 5/12 t(3)=5/6 1/3 A seed non-spam domain 1/3 2 5/12 t(i): The trust score of domain i 4 t(2)=1 The domains 5 and 6 are involved in Web spam. 1/3 t(4)=1/3 Fig. 7: An example explaining Modified TrustRank. July 23, 2011
  • 33. Modified Anti-TrustRank Modification Anti-Trust score should not propagate to non-spam domains. Example 5/12 A seed spam domain at(5)=5/12 1/2 3 1 5 A domain being considered 7 5/12 1/2 5/12 5/12 at(1)=1 at(3)=5/6 5/12 6 at(7)=5/12 + … A seed non-spam domain 1/3 4 at(6)=5/12 + … 2 1/3 at(i): The anti-trust score of domain i at(2)=1 at(4)=1/3 1/3 The domains 5 ,6 and 7 are non- spam domains. Fig. 8: An example explaining Modified Anti-TrustRank. July 23, 2011
  • 34. Modified Spam Mass Modification Use modified TrustRank in place of TrustRank. Example A seed spam domain 1 A domain being considered 2 5 7 6 A seed non-spam domain 3 4 The domain 5 receives many inlinks but only one indirect inlink from a non-spam domain. Fig. 9: An example explaining Modified Spam Mass. July 23, 2011
  • 35. Modified Link Farm Spam Modification Use two types (i.e., spam and non-spam domain) of input seed sets. A domain having many bidirectional links with only trusted domains is not detected as a spam domain. Example 6 8 2 7 A seed non-spam domain 1 3 A domain being considered 4 5 The domains 1, 3, and 4 have two directional links. Fig. 10: An example explaining Modified Link Farm Spam. July 23, 2011
  • 36.
  • 37. Consideration of the execution order in Seed Refiner.
  • 38. Modified TrustRank followed by Modified Anti-TrustRank.
  • 39. Modified Anti-TrustRank followed by Modified TrustRank.
  • 40. Consideration of the execution order in Spam Detector.
  • 41. Modified Spam Mass followed by Modified Link Farm Spam.
  • 42. Modified Link Farm Spam followed by Modified Spam Mass.Manually labeled spam and non-spam domains Seed Refiner Refined spam and non-spam domains Spam Detector Detected spam domains Class Data flow Fig. 11: The strategy of succession. July 23, 2011
  • 43. Performance Evaluation Purpose Show the effect of seed refinement on the quality of web spam filtering. Show the effect of succession on the quality of web spam filtering. Experiments We conduct two sets of the experiments according to the two purposes as mentioned above. Table. 1: Summary of the experiments. July 23, 2011
  • 44. Experimental Parameters Table. 2: Parameters used in experiments. July 23, 2011
  • 45.
  • 46. Experimental Measure Table. 5: Description of the measures. 1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam). July 23, 2011
  • 47.
  • 48. MTR performs either comparable to or slightly better than TR in terms of both true positives and false positives.
  • 49.
  • 50. MATR generally performs better than ATR in terms of true positives
  • 51. We find cutoffATreffective till 180% mark indicating that after 100% detection becomes unstable in terms of false positives.  For later experiments, we fix the cutoffATr at 100% to ensure high precision. July 23, 2011
  • 52. Comparison between Originaland Modified Algorithms (2/3) Experiment 3: Comparison Between SM and MSM MSM performs slightly better than SM in terms of true positives and comparable in terms of false positives We find relativeMasseffective between the range of 0.95 to 0.99 in terms of maximizing true positives and minimizing false positives.  For later experiments, we keep the range from 0.8 to 0.99 of relativeMass as effective range. Experiment 4: Comparison Between LFS and MLFS MLFS performs better than LFS in terms of false positives while at some expense of true positives. We find limitBL and limitOL highly effective at 7 and 7 respectively in terms of minimizing many false positives.  For later experiments, we keep limitBL = 7 and limitOL = 7. July 23, 2011
  • 53. Comparison between Originaland Modified Algorithms (3/3) Summary We have found all modified algorithms providing better quality than the respective original algorithms. We found SM as the best original web spam detection algorithms among ATR, SM, and LFS algorithms due to high true positives and relatively less false positives. We also found MSM as the best modified web spam detection algorithms among MATR, MSM, and MLFS algorithms due to high true positives and relatively less false positives. July 23, 2011
  • 54. The Best Succession for the Seed Refiner Identical performance for both successions Identical performance for both successions Identical performance for both successions Better performance for MATR-MTR compared toMTR-MATR Table. 6: Comparison for the seed refiner. Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner. July 23, 2011
  • 55. The Best Successionfor the Spam Detector Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MLFS fails to detect considerable number of spam domains. We obtain the precisions 0.86, 0.86, 0.93, and 0.87 for MLFS-MSM, MSM-MLFS, MLFS, and MSM respectively. We obtain the recalls 0.80, 0.80, 0.33, and 0.76 for MLFS-MSM, MSM-MLFS, MLFS, and MSM respectively. MLFS-MSM and MSM-MLFS are best and identical in performance, we choose MLFS-MSM as the best spam detector without loss of generality. Fig. 12: Comparison for the spam detector. July 23, 2011
  • 56. Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MATR-MTR-MLFS-MSM finds more true positives and some more false positives. We obtain the precisions 0.85, 0.86, and 0.86 for SM, MSM, and MATR-MTR-LFS-MSM respectively. We obtain the recalls 0.64, 0.70, and 0.80 for SM, MSM, and MATR-MTR-LFS-MSM respectively. Comparison among the Best Succession, theBest Known Algorithm and the Best Modified Algorithm Fig. 13:Comparison among MATR-MTR-MLFS-MSM, SM, and MSM. Therefore, MATR-MTR-MLFS-MSM is more effective. July 23, 2011
  • 57. Conclusions We have improved the quality of web spam filtering by using seed refinement We have proposed modifications in four well-known web spam filtering algorithms. We have proposed a strategy of succession of modified algorithms Seed Refiner contains order of executions for seed refinement algorithms. Spam Detector contains order of executions for spam detection algorithms. We have conducted extensive experiments in order to show the effect of seed refinement on the quality of web spam filtering We find that every modified algorithm performs better than the respective original algorithm. We find the best performance among the successions by MATR followed by MTR, MLFS, and MSM (i.e., MATR-MTR-MSM). This succession outperforms the best original algorithm i.e., SM, by up to 1.25 times in recall and is comparable in terms of precision. July 23, 2011

Editor's Notes

  1. Existing work classify ranking algorithms into two classes as follows. Content-based method: exploiting the vertex information (that is, contents of web pages).Link-based method: exploiting the edges information (that is, a link structure of web).
  2. Now, I explain the original PageRank. The main idea of PageRank is that a web page is more important if it is pointed by many other important web pages.The importance of a web page (called PageRank value) represents the probability that a user visits the web pageI will show how user visit a web page by using this figure. (In this figure, circles represent web pages and arrows represent directed links.) Assume a user is on the web page F. Then, the user can visit the web page C by following the outlink FC of F, and visit other web pages by following the outlinks of C, and so on. (If he gets bored with clicking on the outlinks to visit another web page) The user can also type the address of a random web page and jump into it. Here, user on F randomly jumps to web page B. Obviously, as there are many links to web page C, the user may frequently visit C. Thus, C may be an important web page.Since user has two ways to visit the web page, the probability that user visit a web page or the PageRank value of that web page consists of 2 part. The first part is the probability that user visit web page p by following the outlinks of web pages that have links to p. The second part is the probability that user visit web page p by randomly jump from any web page.
  3. Highly successful, all major businesses use an RDB system
  4. Spam domainThe domain contributing in web spam.Non-spam domainUniverse of domains – {spam domain}
  5. Trusted domain is a subset a non-spam domains set and already known to human as well.
  6. Anti-Trusted domains is a subset of spam domainsSome sex websites
  7. The rank coming from non-spam domain is estimated by TrustRank and the rank coming from non-spam is estimated by TrustRank value from PageRank valueSpam Mass is spam detection algorithm
  8. Detect spam domains by expanding the seed set of spam domains by counting outlinks to the spam domains
  9. Here are the contents of my presentationFirst I introduce the background and motivation of my researchThen, I present related work that contains two approaches for improving the ranking qualityAfter that, I present the algorithms that combine these two approachesIn the main part, I present the performance evaluation Finally, I present the conclusions
  10. Detect spam domains on basis of many bidirectional linksDetect more spam domains by counting outlinks to the spam domains
  11. Detect spam domains on basis of many bidirectional linksDetect more spam domains by counting outlinks to the spam domains