SlideShare a Scribd company logo
TITLE :
PAGE RANKING USING MACHINE LEARNING PAGES 468 - 470
AUTHOR :
S. THARABAI M.TECH - CSE
AM 527249 - 7
CONFERENCE:
IDEA 2015
INDIAN DISTANCE EDUCATION ASSOCIATION
VENUE :
XX IDEAANNUAL CONFERENCE
on
Empowering India through Open and Distance Learning::
Breaking down Barriers, Building Partnership and
Delivering Opportunities
CONFERENCE DATE :
April 23 - 25, 2015
HOSTED BY :
TAMIL NADU OPEN UNIVERSITY
Saidapet, Chennai - 600015
Tamil Nadu, India
ABSTRACT
There are two giant umbrella categories within Machine Learning: supervised and unsu-
pervised learning. Briefly, unsupervised learning is like data mining — look through some data
and see what you can pull out of it. Typically, you don’t have much additional information before
you start the process. We’ll be looking at an unsupervised learning algorithm called k-means.
Static ranking and Machine learning algorithms are key features.
Supervised Learning
Supervised learning, starts with “training data”. Supervised learning is what we do as
children as we learn about the word- ‘apple’. Through this process you come to understand what
an apple is. Not every apple is exactly the same. Had you only ever seen one apple in your life,
you might assume that every apple is identical.
This is called “generalization” and is a very important concept in supervised learning algorithms.
When we build certain types of ML algorithms we therefore need to be aware of this idea of gen-
eralization. Our algorithms should be able to generalize but not over-generalize (easier said than
done!). Many supervised learning problems are “classification” problems. kNN is one type of
many different classification algorithms.
This is a great time to introduce another important aspect of ML: Its features. Features are what
you break an object down into as you’re processing it for ML. If you’re looking to determine
whether an object is an apple or an orange, for instance, you may want to look at the following
features: shape, size, color, waxiness, surface texture, taste etc. It also turns out that sometimes
an individual feature ends up not being helpful. Apples and oranges are roughly the same size,
so that feature can probably be ignored and save you some computational overhead. In this case,
size doesn’t really add new information to the mix. Knowing what features to look for is an im-
portant skill when designing for ML algorithms.
KEYWORDS
Supervised Learning, K-Nearest Neighbour, Static ranking, search engines, PageRank.
1. INTRODUCTION
k-nearest-neighbor and “supervised learning”, can solve some exciting problems.
Strength lie in mathematical or theoretical sophistication, but rather in the fact that they can ele-
gantly handle lots of different input parameters (“dimensions”). Since the publication of Brin and
Page’s paper on PageRank, many in the Web community have depended on PageRank for the
static (query-independent) ordering of Web pages. We show that we can significantly outperform
PageRank using features that are independent of the link structure of the Web. We gain a further
boost in accuracy by using data on the frequency at which users visit Web pages. We use Rank-
Net, a ranking machine learning algorithm, to combine these and other static features based on
anchor text and domain characteristics. The resulting model achieves a static ranking pairwise
accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random).
1.1 Machine Learning for Static Ranking
Over the past decade, the Web has grown exponentially in size. Unfortunately, this
growth has not been isolated to good-quality pages. The number of incorrect, spamming, and ma-
licious (e.g., phishing) sites has also grown rapidly. The sheer number of both good and bad
pages on the Web has led to an increasing reliance on search engines for the discovery of useful
information. Users rely on search engines not only to return pages related to their search query,
but also to separate the good from the bad, and order results so that the best pages are suggested
first. To date, most work on Web page ranking has focused on improving the ordering of the re-
sults returned to the user (query- dependent ranking, or dynamic ranking). However, having a
good query- independent ranking (static ranking) is also crucially important for a search engine.
1.2 A good static ranking algorithm provides numerous benefits:
• Relevance: The static rank of a page provides a general indicator to the overall quality of the
page.This is a useful input to the dynamic ranking algorithm.
• Efficiency: Typically, the search engine’s index is ordered by static rank. By traversing the in-
dex from high- quality to low-quality pages, the dynamic ranker may abort the search when it
determines that no later page will have as high of a dynamic rank as those already found. The
more accurate the static rank, the better this early-stopping ability, and hence the quicker the
search engine may respond to queries.
• Crawl Priority: The Web grows and changes as quickly as search engines can crawl it. Search
engines need a way to prioritize their crawl—to determine which pages to re- crawl, how fre-
quently, and how often to seek out new pages. Among other factors, the static rank of a page is
used to determine this prioritization. A better static rank thus provides the engine with a higher
quality, more up- to-date index.
Google is often regarded as the first commercially successful search engine. Their ranking was
originally based on the PageRank algorithm. Due to this (and possibly due to Google’s promo-
tion of PageRank to the public), PageRank is widely regarded as the best method for the
static ranking of Web pages.
Though PageRank has historically been thought to perform quite well, there has yet been little
academic evidence to support this claim. A single measure like PageRank can be easier to ma-
nipulate because spammers need only concentrate on one goal: how to cause more pages to point
to their page.
A machine learning approach to static ranking is also able to take advantage of any advances in
the machine learning field. For example, recent work on adversarial classification suggests that it
may be possible to explicitly model the Web page spammer’s (the adversary) actions, adjusting
the ranking model in advance of the spammer’s attempts to circumvent it. Another example is the
elimination of outliers in constructing the model, which helps reduce the effect that unique sites
may have on the overall quality of the static rank.
2. PAGERANK
The basic idea behind PageRank is simple: a link from a Web page to another can be seen as an
endorsement of that page. In general, links are made by people. As such, they are indicative of
the quality of the pages to which they point – when creating a page, an author presumably choos-
es to link to pages deemed to be of good quality. We can take advantage of this linkage informa-
tion to order Web pages according to their perceived quality.
Imagine a Web surfer who jumps from Web page to Web page, choosing with uniform probabili-
ty which link to follow at each step. In order to reduce the effect of dead-ends or endless cycles
the surfer will occasionally jump to a random page with some small probability α, or when on a
page with no out-links. If averaged over a sufficient number of steps, the probability the surfer is
on page j at some point in time is given by the formula:
P( j) = (1 − α ) / N + α Σ in B j P(i) / |Fi|
Where Fi is the set of pages that page i links to, and Bj is the set of pages that link to page j. The
PageRank score for node j is defined as this probability: PR(j)=P(j).
3. RANKNET
Much work in machine learning has been done on the problems of classification and regression.
Let X={xi} be a collection of feature vectors (typically, a feature is any real valued number), and
Y={yi} be a collection of associated classes, where yi is the class of the object described by fea-
ture vector xi.
3.1 PageRank
We computed PageRank on a Web graph of 5 billion crawled pages (and 20 billion known URLs
linked to by these pages). This represents a significant portion of the Web, and is approximately
the same number of pages as are used by Google, Yahoo, and MSN for their search engines. An-
other source, internal to search engines, are records of which results their users clicked on. Such
data was used by the search engine “Direct Hit”, and has recently been explored for dynamic
ranking purposes. An advantage of the toolbar data over this is that it contains information about
URL visits that are not just the result of a search.
The raw popularity is processed into a number of features such as the number of times a page
was viewed and the number of times any page in the domain was viewed.
3.2 Anchor text and inlinks
These features are based on the information associated with links to the page in question. It in-
cludes features such as the total amount of text in links pointing to the page (“anchor text”), the
number of unique words in that text, etc.
3.3 Page
This category consists of features which may be determined by looking at the page (and its URL)
alone. We used only eight, simple features such as the number of words in the body, the fre-
quency of the most common term, etc.
3.4 Domain
This category contains features that are computed as averages across all pages in the domain. For
example, the average number of outlinks on any page and the average PageRank. Many of these
features have been used by others for ranking Web pages, particularly the anchor and page fea-
tures. As mentioned, the evaluation is typically for dynamic ranking, and we wish to evaluate the
use of them for static ranking. Also, to our knowledge, this is the first study on the use of actual
pagevisitation popularity for static ranking.
Because we use a wide variety of features to come up with a static ranking, we refer to this as
fRank (for feature-based ranking). fRank uses RankNet and the set of features described in this
section to learn a ranking function for Web pages. Unless otherwise specified, fRank was trained
with all of the features.
4. RANKING METHODS IN MACHINE LEARNING
4.1 Recommendation Systems
4.2 Information Retrieval
!
4.3 Drug Discovery Problem :




4.4. Bioinformatics
Human genetics is now at a critical juncture. The molecular methods used suc-
cessfully to identify the genes underlying rare mendelian syndromes are failing to find
the numerous genes causing more common, familial, non- mendelian diseases . With the
human genome sequence nearing completion, new opportunities are being presented for
unravelling the complex genetic basis of nonmendelian disorders based on large-scale
genomewide studies .
$
$
5. Types of Ranking Problems
➢ Instance Ranking
➢ Label Ranking
➢ Subset Ranking
➢ Rank Aggregation
5.1 Instance Ranking
! ! ! ! !
$ $ $ $ $
5.2 Label Ranking
! !
! !
5.3 Subset Ranking
$ $ $ $ $ $ $ …
! ! ! ! ! ! ! …
5.4 Rank Aggregation
! ! … $
$ ! … !
. . .
Query 2
Query 1
Query 1
Query 2
6. Theory & Algorithms
1. Bipartite Ranking
2. k-partite Ranking
3. Ranking with Real-Valued Labels
4. General Instance Ranking
6.1 Applications
➢ Applications to Bioinformatics
➢ Applications to Drug Discovery
➢ Subset Ranking and Applications to Information Retrieval
1. Bipartite Ranking
! $ $ $ . . .
! ! ! ! ! ...
2. k-partite Ranking
$ $ $ $ . . .
3. Ranking with Real-Valued Labels
$ $
$ $
$ $
. . .
4. General Instance Ranking
! ! ! ! !
! ! ! ! !
6.2 Page Ranking Reports
• Domain Age Report
• Title tag Report
• Keyword Density Report
• Keyword Meta tag Report
• Description Meta tag Report
• Body Text Report
• In page Link Report
• Link Popularity Report
• Outbound Link Report
• IMG ALT Attribute Report
• Server Speed
• Page Rank Report
7. CONCLUSIONS
A good static ranking is an important component for today’s search engines and information re-
trieval systems. We have demonstrated that PageRank does not provide a very good static rank-
ing; there are many simple features that individually out perform PageRank. By combining many
static features, fRank achieves a ranking that has a significantly higher pairwise accuracy than
PageRank alone. A qualitative evaluation of the top documents shows that fRank is less technol-
ogy-biased than PageRank; by using popularity data, it is biased toward pages that Web
users, rather than Web authors, visit.
The machine learning component of fRank gives it the additional benefit of being more robust
against spammers, and allows it to leverage further developments in the machine learning com-
munity in areas such as adversarial classification. We have only begun to explore the options,
and believe that significant strides can be made in the area of static ranking by further experi-
mentation with additional features, other machine learning techniques, and additional sources of
data.
8. ACKNOWLEDGMENTS
Thank you to Marc Najork for providing us with additional PageRank computations and to Timo
Burkard for assistance with the popularity data. Many thanks to Chris Burges for providing code
and significant support in using training RankNets. Also, we thank Susan Dumais and Nick
Craswell for their edits and suggestions.
REFERENCES:
1.http://drona.csa.iisc.ernet.in/~shivani/Events/SDM-10-Tutorial/sdm10-tutorial.pdf
2.www.burakkanber.com/blog
3. http://www.stat.uchicago.edu/~lekheng/meetings/mathofranking/slides/agarwal.pdf

More Related Content

What's hot

Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
Ankit Raj
 
Page Rank
Page RankPage Rank
Page Rank
Pramit Kumar
 
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For YouConductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Conductor
 
Page rank and hyperlink
Page rank and hyperlink Page rank and hyperlink
Page rank and hyperlink
Silicon
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
Semantic Search Engine using Ontologies
Semantic Search Engine using OntologiesSemantic Search Engine using Ontologies
Semantic Search Engine using Ontologies
IJRES Journal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
Subhajit Sahu
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
sathish sak
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
Xu jiakon
 
Determining Relevance Rankings with Search Click Logs
Determining Relevance Rankings with Search Click LogsDetermining Relevance Rankings with Search Click Logs
Determining Relevance Rankings with Search Click Logs
Inderjeet Singh
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic Markup
Bill Slawski
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
Ijcem Journal
 
Mazhiming
MazhimingMazhiming
Mazhiming
lilyyyyyyy
 
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Editor IJCATR
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
Editor IJARCET
 
Pr
PrPr

What's hot (17)

Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
Page Rank
Page RankPage Rank
Page Rank
 
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For YouConductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
 
Page rank and hyperlink
Page rank and hyperlink Page rank and hyperlink
Page rank and hyperlink
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
 
Semantic Search Engine using Ontologies
Semantic Search Engine using OntologiesSemantic Search Engine using Ontologies
Semantic Search Engine using Ontologies
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
 
Determining Relevance Rankings with Search Click Logs
Determining Relevance Rankings with Search Click LogsDetermining Relevance Rankings with Search Click Logs
Determining Relevance Rankings with Search Click Logs
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic Markup
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
Mazhiming
MazhimingMazhiming
Mazhiming
 
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
Pr
PrPr
Pr
 

Similar to Macran

PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
IOSR Journals
 
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVALCONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
ijcsa
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Comparative study of different ranking algorithms adopted by search engine
Comparative study of  different ranking algorithms adopted by search engineComparative study of  different ranking algorithms adopted by search engine
Comparative study of different ranking algorithms adopted by search engine
Echelon Institute of Technology
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
Creationlabz
 
Web Page Ranking using Machine Learning
Web Page Ranking using Machine LearningWeb Page Ranking using Machine Learning
Web Page Ranking using Machine Learning
Pradip Rahul
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
anchalsinghdm
 
TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
ssuser7a8460
 
Search Engine Optimization.pptx
Search Engine Optimization.pptxSearch Engine Optimization.pptx
Search Engine Optimization.pptx
NasreenHameed2
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
inventionjournals
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
Thanh Nguyen
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
James Arnold
 
how to learn Search Engine Optimization for free
how to learn Search Engine Optimization for freehow to learn Search Engine Optimization for free
how to learn Search Engine Optimization for free
SadiaSagheer1
 
K1803057782
K1803057782K1803057782
K1803057782
IOSR Journals
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
Zac Darcy
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IJwest
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
dannyijwest
 
page ranking web crawling
page ranking web crawlingpage ranking web crawling
page ranking web crawling
pradiprahul
 
PAGE RANKING
PAGE RANKING PAGE RANKING
PAGE RANKING
pradiprahul
 

Similar to Macran (20)

PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
 
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVALCONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 
Comparative study of different ranking algorithms adopted by search engine
Comparative study of  different ranking algorithms adopted by search engineComparative study of  different ranking algorithms adopted by search engine
Comparative study of different ranking algorithms adopted by search engine
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
 
Web Page Ranking using Machine Learning
Web Page Ranking using Machine LearningWeb Page Ranking using Machine Learning
Web Page Ranking using Machine Learning
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
Search Engine Optimization.pptx
Search Engine Optimization.pptxSearch Engine Optimization.pptx
Search Engine Optimization.pptx
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
how to learn Search Engine Optimization for free
how to learn Search Engine Optimization for freehow to learn Search Engine Optimization for free
how to learn Search Engine Optimization for free
 
K1803057782
K1803057782K1803057782
K1803057782
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
 
page ranking web crawling
page ranking web crawlingpage ranking web crawling
page ranking web crawling
 
PAGE RANKING
PAGE RANKING PAGE RANKING
PAGE RANKING
 

Recently uploaded

NHL Stenden University of Applied Sciences Diploma Degree Transcript
NHL Stenden University of Applied Sciences Diploma Degree TranscriptNHL Stenden University of Applied Sciences Diploma Degree Transcript
NHL Stenden University of Applied Sciences Diploma Degree Transcript
lhtvqoag
 
LGBTQIA Pride Month presentation Template
LGBTQIA Pride Month presentation TemplateLGBTQIA Pride Month presentation Template
LGBTQIA Pride Month presentation Template
DakshGudwani
 
一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理
一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理
一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理
bo44ban1
 
哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样
哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样
哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样
qo1as76n
 
AHMED TALAAT ARCHITECTURE PORTFOLIO .pdf
AHMED TALAAT ARCHITECTURE PORTFOLIO .pdfAHMED TALAAT ARCHITECTURE PORTFOLIO .pdf
AHMED TALAAT ARCHITECTURE PORTFOLIO .pdf
talaatahm
 
Divertidamente SLIDE.pptxufururururuhrurid8dj
Divertidamente SLIDE.pptxufururururuhrurid8djDivertidamente SLIDE.pptxufururururuhrurid8dj
Divertidamente SLIDE.pptxufururururuhrurid8dj
lunaemel03
 
定制美国西雅图城市大学毕业证学历证书原版一模一样
定制美国西雅图城市大学毕业证学历证书原版一模一样定制美国西雅图城市大学毕业证学历证书原版一模一样
定制美国西雅图城市大学毕业证学历证书原版一模一样
qo1as76n
 
Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...
Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...
Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...
Designforuminternational
 
Heuristics Evaluation - How to Guide.pdf
Heuristics Evaluation - How to Guide.pdfHeuristics Evaluation - How to Guide.pdf
Heuristics Evaluation - How to Guide.pdf
Jaime Brown
 
一比一原版布兰登大学毕业证(BU毕业证书)如何办理
一比一原版布兰登大学毕业证(BU毕业证书)如何办理一比一原版布兰登大学毕业证(BU毕业证书)如何办理
一比一原版布兰登大学毕业证(BU毕业证书)如何办理
wkip62b
 
一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理
一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理
一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理
ijk38lw
 
一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理
一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理
一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理
k7nm6tk
 
原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样
原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样
原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样
hw2xf1m
 
UXpert_Report (UALR Mapping Renewal 2022).pdf
UXpert_Report (UALR Mapping Renewal 2022).pdfUXpert_Report (UALR Mapping Renewal 2022).pdf
UXpert_Report (UALR Mapping Renewal 2022).pdf
anthonylin333
 
一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理
一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理
一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理
21uul8se
 
International Upcycling Research Network advisory board meeting 4
International Upcycling Research Network advisory board meeting 4International Upcycling Research Network advisory board meeting 4
International Upcycling Research Network advisory board meeting 4
Kyungeun Sung
 
一比一原版马来西亚世纪大学毕业证成绩单一模一样
一比一原版马来西亚世纪大学毕业证成绩单一模一样一比一原版马来西亚世纪大学毕业证成绩单一模一样
一比一原版马来西亚世纪大学毕业证成绩单一模一样
k4krdgxx
 
Getting Data Ready for Culture Hack by Neontribe
Getting Data Ready for Culture Hack by NeontribeGetting Data Ready for Culture Hack by Neontribe
Getting Data Ready for Culture Hack by Neontribe
Harry Harrold
 
Introduction to User experience design for beginner
Introduction to User experience design for beginnerIntroduction to User experience design for beginner
Introduction to User experience design for beginner
ellemjani
 
CocaCola_Brand_equity_package_2012__.pdf
CocaCola_Brand_equity_package_2012__.pdfCocaCola_Brand_equity_package_2012__.pdf
CocaCola_Brand_equity_package_2012__.pdf
PabloMartelLpez
 

Recently uploaded (20)

NHL Stenden University of Applied Sciences Diploma Degree Transcript
NHL Stenden University of Applied Sciences Diploma Degree TranscriptNHL Stenden University of Applied Sciences Diploma Degree Transcript
NHL Stenden University of Applied Sciences Diploma Degree Transcript
 
LGBTQIA Pride Month presentation Template
LGBTQIA Pride Month presentation TemplateLGBTQIA Pride Month presentation Template
LGBTQIA Pride Month presentation Template
 
一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理
一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理
一比一原版阿肯色大学毕业证(UCSF毕业证书)如何办理
 
哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样
哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样
哪里办理美国中央华盛顿大学毕业证双学位证书原版一模一样
 
AHMED TALAAT ARCHITECTURE PORTFOLIO .pdf
AHMED TALAAT ARCHITECTURE PORTFOLIO .pdfAHMED TALAAT ARCHITECTURE PORTFOLIO .pdf
AHMED TALAAT ARCHITECTURE PORTFOLIO .pdf
 
Divertidamente SLIDE.pptxufururururuhrurid8dj
Divertidamente SLIDE.pptxufururururuhrurid8djDivertidamente SLIDE.pptxufururururuhrurid8dj
Divertidamente SLIDE.pptxufururururuhrurid8dj
 
定制美国西雅图城市大学毕业证学历证书原版一模一样
定制美国西雅图城市大学毕业证学历证书原版一模一样定制美国西雅图城市大学毕业证学历证书原版一模一样
定制美国西雅图城市大学毕业证学历证书原版一模一样
 
Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...
Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...
Discovering the Best Indian Architects A Spotlight on Design Forum Internatio...
 
Heuristics Evaluation - How to Guide.pdf
Heuristics Evaluation - How to Guide.pdfHeuristics Evaluation - How to Guide.pdf
Heuristics Evaluation - How to Guide.pdf
 
一比一原版布兰登大学毕业证(BU毕业证书)如何办理
一比一原版布兰登大学毕业证(BU毕业证书)如何办理一比一原版布兰登大学毕业证(BU毕业证书)如何办理
一比一原版布兰登大学毕业证(BU毕业证书)如何办理
 
一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理
一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理
一比一原版(Vancouver毕业证书)温哥华岛大学毕业证如何办理
 
一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理
一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理
一比一原版(LSBU毕业证书)伦敦南岸大学毕业证如何办理
 
原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样
原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样
原版制作(MDIS毕业证书)新加坡管理发展学院毕业证学位证一模一样
 
UXpert_Report (UALR Mapping Renewal 2022).pdf
UXpert_Report (UALR Mapping Renewal 2022).pdfUXpert_Report (UALR Mapping Renewal 2022).pdf
UXpert_Report (UALR Mapping Renewal 2022).pdf
 
一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理
一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理
一比一原版亚利桑那大学毕业证(UA毕业证书)如何办理
 
International Upcycling Research Network advisory board meeting 4
International Upcycling Research Network advisory board meeting 4International Upcycling Research Network advisory board meeting 4
International Upcycling Research Network advisory board meeting 4
 
一比一原版马来西亚世纪大学毕业证成绩单一模一样
一比一原版马来西亚世纪大学毕业证成绩单一模一样一比一原版马来西亚世纪大学毕业证成绩单一模一样
一比一原版马来西亚世纪大学毕业证成绩单一模一样
 
Getting Data Ready for Culture Hack by Neontribe
Getting Data Ready for Culture Hack by NeontribeGetting Data Ready for Culture Hack by Neontribe
Getting Data Ready for Culture Hack by Neontribe
 
Introduction to User experience design for beginner
Introduction to User experience design for beginnerIntroduction to User experience design for beginner
Introduction to User experience design for beginner
 
CocaCola_Brand_equity_package_2012__.pdf
CocaCola_Brand_equity_package_2012__.pdfCocaCola_Brand_equity_package_2012__.pdf
CocaCola_Brand_equity_package_2012__.pdf
 

Macran

  • 1. TITLE : PAGE RANKING USING MACHINE LEARNING PAGES 468 - 470 AUTHOR : S. THARABAI M.TECH - CSE AM 527249 - 7 CONFERENCE: IDEA 2015 INDIAN DISTANCE EDUCATION ASSOCIATION VENUE : XX IDEAANNUAL CONFERENCE on Empowering India through Open and Distance Learning:: Breaking down Barriers, Building Partnership and Delivering Opportunities CONFERENCE DATE : April 23 - 25, 2015 HOSTED BY : TAMIL NADU OPEN UNIVERSITY Saidapet, Chennai - 600015 Tamil Nadu, India
  • 2. ABSTRACT There are two giant umbrella categories within Machine Learning: supervised and unsu- pervised learning. Briefly, unsupervised learning is like data mining — look through some data and see what you can pull out of it. Typically, you don’t have much additional information before you start the process. We’ll be looking at an unsupervised learning algorithm called k-means. Static ranking and Machine learning algorithms are key features. Supervised Learning Supervised learning, starts with “training data”. Supervised learning is what we do as children as we learn about the word- ‘apple’. Through this process you come to understand what an apple is. Not every apple is exactly the same. Had you only ever seen one apple in your life, you might assume that every apple is identical. This is called “generalization” and is a very important concept in supervised learning algorithms. When we build certain types of ML algorithms we therefore need to be aware of this idea of gen- eralization. Our algorithms should be able to generalize but not over-generalize (easier said than done!). Many supervised learning problems are “classification” problems. kNN is one type of many different classification algorithms. This is a great time to introduce another important aspect of ML: Its features. Features are what you break an object down into as you’re processing it for ML. If you’re looking to determine whether an object is an apple or an orange, for instance, you may want to look at the following features: shape, size, color, waxiness, surface texture, taste etc. It also turns out that sometimes an individual feature ends up not being helpful. Apples and oranges are roughly the same size, so that feature can probably be ignored and save you some computational overhead. In this case, size doesn’t really add new information to the mix. Knowing what features to look for is an im- portant skill when designing for ML algorithms. KEYWORDS Supervised Learning, K-Nearest Neighbour, Static ranking, search engines, PageRank.
  • 3. 1. INTRODUCTION k-nearest-neighbor and “supervised learning”, can solve some exciting problems. Strength lie in mathematical or theoretical sophistication, but rather in the fact that they can ele- gantly handle lots of different input parameters (“dimensions”). Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use Rank- Net, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random). 1.1 Machine Learning for Static Ranking Over the past decade, the Web has grown exponentially in size. Unfortunately, this growth has not been isolated to good-quality pages. The number of incorrect, spamming, and ma- licious (e.g., phishing) sites has also grown rapidly. The sheer number of both good and bad pages on the Web has led to an increasing reliance on search engines for the discovery of useful information. Users rely on search engines not only to return pages related to their search query, but also to separate the good from the bad, and order results so that the best pages are suggested first. To date, most work on Web page ranking has focused on improving the ordering of the re- sults returned to the user (query- dependent ranking, or dynamic ranking). However, having a good query- independent ranking (static ranking) is also crucially important for a search engine. 1.2 A good static ranking algorithm provides numerous benefits: • Relevance: The static rank of a page provides a general indicator to the overall quality of the page.This is a useful input to the dynamic ranking algorithm. • Efficiency: Typically, the search engine’s index is ordered by static rank. By traversing the in- dex from high- quality to low-quality pages, the dynamic ranker may abort the search when it determines that no later page will have as high of a dynamic rank as those already found. The more accurate the static rank, the better this early-stopping ability, and hence the quicker the search engine may respond to queries.
  • 4. • Crawl Priority: The Web grows and changes as quickly as search engines can crawl it. Search engines need a way to prioritize their crawl—to determine which pages to re- crawl, how fre- quently, and how often to seek out new pages. Among other factors, the static rank of a page is used to determine this prioritization. A better static rank thus provides the engine with a higher quality, more up- to-date index. Google is often regarded as the first commercially successful search engine. Their ranking was originally based on the PageRank algorithm. Due to this (and possibly due to Google’s promo- tion of PageRank to the public), PageRank is widely regarded as the best method for the static ranking of Web pages. Though PageRank has historically been thought to perform quite well, there has yet been little academic evidence to support this claim. A single measure like PageRank can be easier to ma- nipulate because spammers need only concentrate on one goal: how to cause more pages to point to their page. A machine learning approach to static ranking is also able to take advantage of any advances in the machine learning field. For example, recent work on adversarial classification suggests that it may be possible to explicitly model the Web page spammer’s (the adversary) actions, adjusting the ranking model in advance of the spammer’s attempts to circumvent it. Another example is the elimination of outliers in constructing the model, which helps reduce the effect that unique sites may have on the overall quality of the static rank. 2. PAGERANK The basic idea behind PageRank is simple: a link from a Web page to another can be seen as an endorsement of that page. In general, links are made by people. As such, they are indicative of the quality of the pages to which they point – when creating a page, an author presumably choos- es to link to pages deemed to be of good quality. We can take advantage of this linkage informa- tion to order Web pages according to their perceived quality. Imagine a Web surfer who jumps from Web page to Web page, choosing with uniform probabili- ty which link to follow at each step. In order to reduce the effect of dead-ends or endless cycles the surfer will occasionally jump to a random page with some small probability α, or when on a
  • 5. page with no out-links. If averaged over a sufficient number of steps, the probability the surfer is on page j at some point in time is given by the formula: P( j) = (1 − α ) / N + α Σ in B j P(i) / |Fi| Where Fi is the set of pages that page i links to, and Bj is the set of pages that link to page j. The PageRank score for node j is defined as this probability: PR(j)=P(j). 3. RANKNET Much work in machine learning has been done on the problems of classification and regression. Let X={xi} be a collection of feature vectors (typically, a feature is any real valued number), and Y={yi} be a collection of associated classes, where yi is the class of the object described by fea- ture vector xi. 3.1 PageRank We computed PageRank on a Web graph of 5 billion crawled pages (and 20 billion known URLs linked to by these pages). This represents a significant portion of the Web, and is approximately the same number of pages as are used by Google, Yahoo, and MSN for their search engines. An- other source, internal to search engines, are records of which results their users clicked on. Such data was used by the search engine “Direct Hit”, and has recently been explored for dynamic ranking purposes. An advantage of the toolbar data over this is that it contains information about URL visits that are not just the result of a search. The raw popularity is processed into a number of features such as the number of times a page was viewed and the number of times any page in the domain was viewed. 3.2 Anchor text and inlinks These features are based on the information associated with links to the page in question. It in- cludes features such as the total amount of text in links pointing to the page (“anchor text”), the number of unique words in that text, etc. 3.3 Page This category consists of features which may be determined by looking at the page (and its URL) alone. We used only eight, simple features such as the number of words in the body, the fre- quency of the most common term, etc.
  • 6. 3.4 Domain This category contains features that are computed as averages across all pages in the domain. For example, the average number of outlinks on any page and the average PageRank. Many of these features have been used by others for ranking Web pages, particularly the anchor and page fea- tures. As mentioned, the evaluation is typically for dynamic ranking, and we wish to evaluate the use of them for static ranking. Also, to our knowledge, this is the first study on the use of actual pagevisitation popularity for static ranking. Because we use a wide variety of features to come up with a static ranking, we refer to this as fRank (for feature-based ranking). fRank uses RankNet and the set of features described in this section to learn a ranking function for Web pages. Unless otherwise specified, fRank was trained with all of the features. 4. RANKING METHODS IN MACHINE LEARNING 4.1 Recommendation Systems 4.2 Information Retrieval !
  • 7. 4.3 Drug Discovery Problem : 
 
 4.4. Bioinformatics Human genetics is now at a critical juncture. The molecular methods used suc- cessfully to identify the genes underlying rare mendelian syndromes are failing to find the numerous genes causing more common, familial, non- mendelian diseases . With the human genome sequence nearing completion, new opportunities are being presented for unravelling the complex genetic basis of nonmendelian disorders based on large-scale genomewide studies .
  • 8. $ $ 5. Types of Ranking Problems ➢ Instance Ranking ➢ Label Ranking ➢ Subset Ranking ➢ Rank Aggregation
  • 9. 5.1 Instance Ranking ! ! ! ! ! $ $ $ $ $ 5.2 Label Ranking ! ! ! !
  • 10. 5.3 Subset Ranking $ $ $ $ $ $ $ … ! ! ! ! ! ! ! … 5.4 Rank Aggregation ! ! … $ $ ! … ! . . . Query 2 Query 1 Query 1 Query 2
  • 11. 6. Theory & Algorithms 1. Bipartite Ranking 2. k-partite Ranking 3. Ranking with Real-Valued Labels 4. General Instance Ranking 6.1 Applications ➢ Applications to Bioinformatics ➢ Applications to Drug Discovery ➢ Subset Ranking and Applications to Information Retrieval 1. Bipartite Ranking ! $ $ $ . . . ! ! ! ! ! ... 2. k-partite Ranking $ $ $ $ . . .
  • 12. 3. Ranking with Real-Valued Labels $ $ $ $ $ $ . . . 4. General Instance Ranking ! ! ! ! ! ! ! ! ! !
  • 13. 6.2 Page Ranking Reports • Domain Age Report • Title tag Report • Keyword Density Report • Keyword Meta tag Report • Description Meta tag Report • Body Text Report • In page Link Report • Link Popularity Report • Outbound Link Report • IMG ALT Attribute Report • Server Speed • Page Rank Report 7. CONCLUSIONS A good static ranking is an important component for today’s search engines and information re- trieval systems. We have demonstrated that PageRank does not provide a very good static rank- ing; there are many simple features that individually out perform PageRank. By combining many static features, fRank achieves a ranking that has a significantly higher pairwise accuracy than PageRank alone. A qualitative evaluation of the top documents shows that fRank is less technol- ogy-biased than PageRank; by using popularity data, it is biased toward pages that Web users, rather than Web authors, visit. The machine learning component of fRank gives it the additional benefit of being more robust against spammers, and allows it to leverage further developments in the machine learning com- munity in areas such as adversarial classification. We have only begun to explore the options, and believe that significant strides can be made in the area of static ranking by further experi- mentation with additional features, other machine learning techniques, and additional sources of data.
  • 14. 8. ACKNOWLEDGMENTS Thank you to Marc Najork for providing us with additional PageRank computations and to Timo Burkard for assistance with the popularity data. Many thanks to Chris Burges for providing code and significant support in using training RankNets. Also, we thank Susan Dumais and Nick Craswell for their edits and suggestions. REFERENCES: 1.http://drona.csa.iisc.ernet.in/~shivani/Events/SDM-10-Tutorial/sdm10-tutorial.pdf 2.www.burakkanber.com/blog 3. http://www.stat.uchicago.edu/~lekheng/meetings/mathofranking/slides/agarwal.pdf