My talk about PageRank in the Department of Electrical Engineering, National Taiwan University in 2010. Caveat: many math symbols got scrambled, damn software backward compatibility.
Entity Linking, the task of linking mentions (of persons, organizations, etc…) found in a document to a unique entity in a knowledge base, while deceptively simple, has proven to be very challenging to perform. This task is even harder when documents in different languages, or from restricted domains, are considered.
Entity Linking is important to understand the topic of articles or social media posts and can be used for marketing, advertising, and many more applications.
Most of the existing research on the topic is based on Natural Language Processing and on supervised models, which provide little flexibility and generalization capabilities.
Instead, it is possible to leverage the graph-like structure of large knowledge bases like DBpedia to vastly improve the quality of Entity Linking.
Furthermore, it is possible to represent input documents in a graph-like way and exploit measures of topological similarity between the original document and the knowledge base to collectively link all the mentions in a document at the same time.
In this work, we implement and extend the state-of-the-art Entity Linking system called Quantified Collective Validation, by using Oracle PGX to analyze in-memory and in a parallelized way the full DBpedia graph, in order to efficiently and effectively perform entity linking on tweets and news articles.
Named Entity Disambiguation, the task of linking Named Entities (of persons, organizations, etc…) found in a document to a unique entity in a knowledge base, while deceptively simple, has proven to be very challenging to perform. This task is even harder when documents in different languages, or from restricted domains, are considered.
Named Entity Disambiguation is important to understand the topic of articles or social media posts and can be used for marketing, advertising, and many more applications.
Most of the existing research on the topic is based on Natural Language Processing and on supervised models, which provide little flexibility and generalization capabilities.
Instead, it is possible to leverage the graph-like structure of large knowledge bases like DBpedia to vastly improve the quality of Named Entity Disambiguation.
Furthermore, it is possible to represent input documents in a graph-like way and exploit measures of topological similarity between the original document and the knowledge base to collectively link all the Named Entities in a document at the same time.
In this work, we implement and extend the state-of-the-art Named Entity Disambiguation system called Quantified Collective Validation, by using Oracle PGX to analyze in-memory and in a parallelized way the full DBpedia graph, in order to efficiently and effectively perform entity disambiguation on tweets and news articles.
Optimizing Search Engines(A Mathematical Point view)
Following things covered here
- A basic introduction to Search Engine Optimizing.
Introduction to Google and Bing Webmaster.
- Use of Google Toolbar to see Page Rank of each page(Calculating importance of each page for Google Search Engines.)
- PageRank Algorithm(I will the focus on this point mostly).
- How it is useful to real SEO and practical implementation of SEO.
- Google Bomb.
This is the class project of Stanford Univ. online course Mining Massive Datasets. The mission is to write a Hadoop program that calculates Pagerank for 2002 Google Programming Contest Web Graph Data.
Entity Linking, the task of linking mentions (of persons, organizations, etc…) found in a document to a unique entity in a knowledge base, while deceptively simple, has proven to be very challenging to perform. This task is even harder when documents in different languages, or from restricted domains, are considered.
Entity Linking is important to understand the topic of articles or social media posts and can be used for marketing, advertising, and many more applications.
Most of the existing research on the topic is based on Natural Language Processing and on supervised models, which provide little flexibility and generalization capabilities.
Instead, it is possible to leverage the graph-like structure of large knowledge bases like DBpedia to vastly improve the quality of Entity Linking.
Furthermore, it is possible to represent input documents in a graph-like way and exploit measures of topological similarity between the original document and the knowledge base to collectively link all the mentions in a document at the same time.
In this work, we implement and extend the state-of-the-art Entity Linking system called Quantified Collective Validation, by using Oracle PGX to analyze in-memory and in a parallelized way the full DBpedia graph, in order to efficiently and effectively perform entity linking on tweets and news articles.
Named Entity Disambiguation, the task of linking Named Entities (of persons, organizations, etc…) found in a document to a unique entity in a knowledge base, while deceptively simple, has proven to be very challenging to perform. This task is even harder when documents in different languages, or from restricted domains, are considered.
Named Entity Disambiguation is important to understand the topic of articles or social media posts and can be used for marketing, advertising, and many more applications.
Most of the existing research on the topic is based on Natural Language Processing and on supervised models, which provide little flexibility and generalization capabilities.
Instead, it is possible to leverage the graph-like structure of large knowledge bases like DBpedia to vastly improve the quality of Named Entity Disambiguation.
Furthermore, it is possible to represent input documents in a graph-like way and exploit measures of topological similarity between the original document and the knowledge base to collectively link all the Named Entities in a document at the same time.
In this work, we implement and extend the state-of-the-art Named Entity Disambiguation system called Quantified Collective Validation, by using Oracle PGX to analyze in-memory and in a parallelized way the full DBpedia graph, in order to efficiently and effectively perform entity disambiguation on tweets and news articles.
Optimizing Search Engines(A Mathematical Point view)
Following things covered here
- A basic introduction to Search Engine Optimizing.
Introduction to Google and Bing Webmaster.
- Use of Google Toolbar to see Page Rank of each page(Calculating importance of each page for Google Search Engines.)
- PageRank Algorithm(I will the focus on this point mostly).
- How it is useful to real SEO and practical implementation of SEO.
- Google Bomb.
This is the class project of Stanford Univ. online course Mining Massive Datasets. The mission is to write a Hadoop program that calculates Pagerank for 2002 Google Programming Contest Web Graph Data.
This presentation won me the best presentation award at my University Tech fest "Allegretto" in 2008.
I have also presented this seminar as a part of B.Tech curriculum in 7th Semester.
This presentation describes in simple terms how the PageRank algorithm by Google founders works. It displays the actual algorithm as well as tried to explain how the calculations are done and how ranks are assigned to any webpage.
PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by {\displaystyle PR(E).} PR(E). Other factors like Author Rank can contribute to the importance of an entity.
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself.
Numerous academic papers concerning PageRank have been published since Page and Brin's original paper.[5] In practice, the PageRank concept may be vulnerable to manipulation. Research has been conducted into identifying falsely influenced PageRank rankings. The goal is to find an effective means of ignoring links from documents with falsely influenced PageRank.
Other link-based ranking algorithms for Web pages include the HITS algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com),the IBM CLEVER project, the TrustRank algorithm and the hummingbird algorithm.
Deep Learning for Recommendations: Fundamentals and Advances
In this part, we focus on Graph Neural Networks for Recommendations.
Tutorial Website/slides: https://advanced-recommender-systems.github.io/ijcai2021-tutorial/
https://youtu.be/4aXk3LNTJRc
This presentation won me the best presentation award at my University Tech fest "Allegretto" in 2008.
I have also presented this seminar as a part of B.Tech curriculum in 7th Semester.
This presentation describes in simple terms how the PageRank algorithm by Google founders works. It displays the actual algorithm as well as tried to explain how the calculations are done and how ranks are assigned to any webpage.
PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by {\displaystyle PR(E).} PR(E). Other factors like Author Rank can contribute to the importance of an entity.
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself.
Numerous academic papers concerning PageRank have been published since Page and Brin's original paper.[5] In practice, the PageRank concept may be vulnerable to manipulation. Research has been conducted into identifying falsely influenced PageRank rankings. The goal is to find an effective means of ignoring links from documents with falsely influenced PageRank.
Other link-based ranking algorithms for Web pages include the HITS algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com),the IBM CLEVER project, the TrustRank algorithm and the hummingbird algorithm.
Deep Learning for Recommendations: Fundamentals and Advances
In this part, we focus on Graph Neural Networks for Recommendations.
Tutorial Website/slides: https://advanced-recommender-systems.github.io/ijcai2021-tutorial/
https://youtu.be/4aXk3LNTJRc
Hi All,
This Presentation will feature more about the working of search engine how do the inner functionality takes place. In the later half of the Presentation the Page Rank will be explained in depth. how do they calculate it, How it differing from the actual PR, Google PR. How frequently they do update the PR value in the google. and lots more with calculation and few examples.
Slides from 'Stay Calm & Keep Current' - How to filter machine learning related academic papers - introduction of our open source project for this purpose.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
2. Disclaimer (legal)
The content of this talk is the speaker's personal opinion and
is not the opinion or policy of his employer.
3. Outline
● Introduction
● Google and Google search
● PageRank algorithm for ranking web pages
● Using MapReduce to calculate PageRank for billions of
pages
● Impact factor of journals and PageRank
● Conclusion
4. Google
● The name: homophone to the word “Googol” which means
10100
.
● The company:
○ founded by Larry Page and Sergey Brin in 1998,
○ ~20,000 employees as of 2009,
○ spread in 68 offices around the world (23 in N. America, 3 in
Latin America, 14 in Asia Pacific, 23 in Europe, 5 in Middle
East and Africa).
● The mission: “to organize the world's information and make
it universally accessible and useful.”
5. Google Services
Photo by mr.hero on panoramio (http://www.panoramio.com/photo/1127015)
news
book
search
picasaweb
product
search
video
maps
Gmail
Android
blogger.com
groups
calendar
scholar
Earth
Sky
web search
desktop
translate
talk
YouTube
Chrome
reader
iGoogle
7. The abundance problem
Quote Langville and Meyer's nice book “Google's PageRank
and beyond: the science of search engine rankings”:
The men in Jorge Luis Borges’ 1941 short story, “The Library
of Babel”, which describes an imaginary, infinite library.
● When it was proclaimed that the Library contained all books, the
first impression was one of extravagant happiness. All men felt
themselves to be the masters of an intact and secret treasure.
There was no personal or world problem whose eloquent
solution did not exist in some hexagon.
. . . As was natural, this inordinate hope was followed by an
excessive depression. The certitude that some shelf in some
hexagon held precious books and that these precious books
8. The power of web search engines nowadays
Search vs. Typical desktop
document application
Data size O(1014
B) O(106
B)
Speed ~ 250 ms ~ 1s
Relevance high low
Cloud
Computing
Ranking
Web
Proliferation
9. Serving a query
Google Web Server
Spell checker
Ad Server
I0
I1
I2
IN
I0
I1
I2
IN
I0
I1
I2
IN
…
…
D0
D1
DM
D0
D1
DM
D0
D1
DM
…
…
IEEE Micro, 2003
query
Misc. servers
Doc serversIndex servers
Index shards Doc shards
0.25 seconds, 1000+ machines
10. Major contributors
● Ranking the results: Larry Page,
“For the creation of the Google search engine.”
(2004)
● Indexing the web: Sergey Brin,
“For leadership in development of rapid indexing
and retrieval of relevant information from the
World Wide Web.” (2009)
● Dealing with scales: Jeff Dean and
Sanjay Ghemawat,
“for contributions to the science and
engineering of large-scale distribute
computer systems.” (2009)
11. Google Scholar
● Search for scholarly literature from one place
○ Motto: “Stand on the shoulders of giants”
● http://scholar.google.com/
12. What can be found with Google Scholar?
● Types: articles, theses, books and abstracts.
○ Added court opinions from U.S. federal and state district,
appellate and supreme courts on Nov 17.
● Sources: academic publishers, professional societies,
online repositories, universities and other web sites.
14. Ranking the web
● Given a query, which are the most relevant / authoritative
pages?
● Which pages should be in the search result?
○ Find pages containing the query?
■ Not enough. By searching “NTU physics” you expect to see
http://www.phys.ntu.edu.tw/ in results, but that page contains
only one “NTU” and zero “physics” - because it's in Chinese!
○ Use links, anchor texts and other means to bring it in
● How to rank these pages?
○ Content analysis
○ Link analysis
15. The web as a graph
www.cnn.com
www.nytimes.com
en.wikipedia.org
16. Link analysis of the web graph
● Crawl web pages and form a graph G = (V, E)
○ V = vertices = nodes in the graph. One node is one web
page with a unique URL. Let N = | V |.
○ E = edges = links in the graph. One link is one hyperlink.
● Calculate and attach score/indicators/ranks to nodes based
on the links
○ in-links/out-links: links that enter/leave a node
○ in-degree/out-degree = number of in-links/out-links
● NxN Adjacency matrix A: Aij
= 1 if (i,j) ∈ E, 0 otherwise.
○ A is sparse: average out-degree ~ 10, N ~ O(1010
)
17. Diameter of the web by Albert et. al. (1999)
● Power-law distribution of
out-links (a) and in-links (b)
with out
= 2.45, in
= 2.1.
“Scale-free network”
● Mean distance vs. graph
size.
“Small world”
18. Diameter of the web by Albert et. al. (1999)
● Power-law distribution of
out-links (a) and in-links (b)
with out
= 2.45, in
= 2.1.
“Scale-free network”
● Mean distance vs. graph
size.
“Small world”
19. The challenges of web search
● Web is huge.
○ O(1010
) pages, O(1011
) links
○ the web graph most likely can't be stored in one computer's
memory.
● Web is dynamic. Web pages appear and disappear
everyday.
● Web is self-organized. Every author is free to create/delete
pages and make links to which ever page s/he likes. There
is no “formula” or central control
20. 3
4
5
6
1
2
7
8
A test graph
source sink
auth
hub
source: is like a
master thesis that
nobody knows
about.
sink: is a paper
that cites nobody
hub: much more
out-links than in-
links, like a review
article.
auth: much more in-
links than out-links, like
an “authoritative” paper.
21. Bibliometrics
● The study of written documents and their citation structure.
● A long and profound interest in use of citations to produce
quantitative estimates of the importance and “impact” of
individual scientific papers and journals.
○ Scientific value of a paper is only as much as recognized by
the social network in the scientific community?
● The most well-known metric: Garfield’s impact factor.
22. Garfield's Impact Factor (1972)
● Used to provide a numerical assessment of journals in
Journal Citation Reports of the Institute for Scientific
Information.
○ used in criteria for Ph.D. completion or as a reference point in
proposal review in some places.
● IF of a journal X = average number of citations per paper
received this year for articles published in X in last 2 years.
24. Katz's Path-counting algorithm (1953)
● For evaluating the “status” of a person or a group by who
“chooses” them.
● Number of paths from person i to person j of length k =
Note that
● Contribution from person i to person j's status = a weighted
path sum. The constant b is context and group dependent.
● The “status” of person j is then the sum of contributions
from all other persons.
26. Features of Katz's Path-Counting algorithm
● More than just direct link counting
○ Indirect links are taken into account
● The constant b is an effectiveness of a link. It is
accumulated along a path.
● Problems when there are cross links or cycles in the graph
○ Cross links: every non-zero with the path going through one
node in a cross link implies nonzero , generally an over-
estimate.
○ Not an issue with journal publications.
○ Web pages can be updated, making cycles a common
phenomena.
27. The HITS algorithm by Kleinberg (1999)
● HITS = Hyperlink-Induced Topic Search, a.k.a. “Hubs and
Authorities” algorithm.
● From a pre-selected graph of N pages, try to find hubs
(out-link dominant) and authorities (in-link dominant).
○ assign page k a hub score hk
and an authority score ak
.
○ aj
is the sum of hi
over all pages i that has (i j) link.
hi
is the sum of aj
over all pages j that has (i→j) link.
where A is the adjacency matrix.
○ normalized vector h = (h1
, h2
, … hN
), a = (a1
, a2
, … aN
)
28. HITS algorithm (cont.)
● Start with a0
= h0
= (1, 1, … 1), iterate until converge.
○ convergence to the vector with largest | | is guaranteed by
eigenvalue theory.
● Pick top X pages with highest authority scores to be the
best search results.
30. Pinski-Narin: Influence (1976)
● “Not all citations are created equal”
● Impact of journal j = influence weight of journal j = rj
● Wij
= fraction of citations of journal i that goes to journal j
● Influence of journal j = weighted sums of influences of
journals that cite journal j
○ More citations gives higher influences, citations from
influential journals gives higher influences.
● r is the eigenvector of matrix WT
with eigenvalue 1.
○ Question: Does such a solution always exist?
31. Calculation of influence
● From the eigenvalue equation
we can use the Power Method to find the eigenvector of the
largest eigenvalue.
● Start with r0
= (1, 1, … 1)T
.
● Calculate rk
= WT
rk-1
.
● Repeat until convergence is reached:
● Questions:
○ choice of r0
?
○ convergence?
33. Eliminate both the source and the sink
3
4
5
6
1
2
7
8
0.07
0.02
0.27
0.02
0.02
0.14
0.23
0.24
34. Features of Pinski and Narin's Influence
● Propagation of influences
● Geller introduced Markov chain interpretation to
in 1978.
○ W is a stochastic matrix.
○ Markov process: reader randomly start with a journal, then
choose another journal randomly from the citations.
○ rj
= the stationary probability that journal j is read. It's natural
to use the L1 norm for the r vector, i.e.,
○ Perron–Frobenius theorem: principal eigenvalue = 1 for
stochastic matrices.
35. What's wrong?
The sink
The source
It's no longer a transition matrix!
No solution is guaranteed if there are sinks or sources. A
group of nodes linking to each other but nowhere else is
36. The PageRank algorithm
● Improve over Pinski-Narin's fragility on sinks by introducing
a random surfer model.
○ A random surfer reads pages, randomly following links or
jump to a random page:
where v is a “preference vector” as a probability distribution:
○ rj
= the probability that page j is read.
○ For the case of vj
= constant, it becomes
● Use power method to solve r.
38. Without sinks and sources
3
4
5
6
1
2
7
8
use =
0.85
7.3
3.4
25.3
3.4
4.2
12.7
22.0
23.1
PageRank * 100
39. The rank sink problem
● The sink is still a problem, although convergence is
reached and non-zero values are obtained. Why?
● Re-write the PageRank equation to eigenvalue form:
where
● Is P a column-stochastic matrix?
● Wait... first term = 0 if node i is a sink!
40. The Google matrix
● Define
where aj
= 1/N if node j is a sink node, 0 otherwise.
● Now G is a column-stochastic matrix.
e is a column vector with all 1's.
● The final PageRank equation is
○ PageRank Pinski-Narin influence as α 1 .
○ PageRank uniform constant as α 0.
41. PageRank from the Google matrix
3
4
5
6
1
2
7
8
5.3
6.5
19.2
6.5
7.8
13.5
32.5
10.1
PageRank * 100
use =
0.85,
v = e / N
42. Discussion on PageRank
● Is PageRank guaranteed to converge?
● What's the reasonable value of α? How sensitive is the
ranks with respect to it?
● How to calculate it for the web graph?
● Independent of query – need to add query-dependent
rankings for real world applications.
43. Convergence of PageRank Calculation
● The speed of convergence of the power method is
○ It can be proven that for the Google matrix.
○ Brin and Page used 50 to 100 iterations to converge,
0.8550
= 3 x 10-4
.
● Choosing close to 1 emphasizes the link structure of the
web, but the convergence will be much slower.
● Choosing close to 0 converges quickly, but the result is
close to uniform ranks and useless.
44. Stability of PageRank
● Evaluate the change of principal eigenvector when the
Google matrix is perturbed by changing α or the link
structure.
● The condition number of G is shown by Kamvar and
Haveliwala to be
○ Let
○ Then
○ Implications:
■ PageRank is not stable for close to 1.
■ PageRank is reasonably stable for < 0.9.
45. More meanings of the α parameter
● Revisit the PageRank equation:
● With uniform v, it becomes
Compare with Katz's path-counting algorithm:
The α constant plays a role of damping of influence along a
path. That's why it is called “damping factor” in later
literatures.
○ S vs. A: S has branching factor 1/L, a penalty for cycles.
47. Practical considerations
● Storage:
○ The G matrix is dense due to the sink nodes – make use of
the sparse W matrix and rank-one updates aj
and vj
.
○ Store Wij
, aj
, vj
and rj
(k)
: O(10N) + O(N) + 2N = O(N) ~ O(TB)
● Calculation:
○ The power method can easily be parallelized (it better is!)
○ Decomposition of G matrix is one way
○ Rank forwarding is another
○ Use your creativity!
49. Rank forwarding approach
● Use the equation
● Phase 1: outgoing links
○ A master process reads data, and send data of one node to
one process.
input = i, out-links j1
, j2
, ..., jL
, vi
, PageRank ri
(k)
output = key-value pairs (j1
, ri
(k)
/L), … (jL
, ri
(k)
/L),
(i, ai
+ (1- )vi
) {key = node id}
● Phase 2: shuffle:
○ Send (key, value) pairs of the same key to the same process
● Phase 3: incoming links
○ Each process calculates one node at a time
input: (j, v1
), (j, v2
), …
output = (j, Σ v) = (j, rj
(k)
), and write to disk
50. MapReduce
● The 3-phase calculation is applicable to various massive
data processing tasks
○ Read a lot of data
○ Map: extract something you care about from each record
○ Shuffle and Sort
○ Reduce: aggregate, summarize, filter, or transform
○ Write the results
● Google developed a framework around this paradigm of
parallel computing, called MapReduce.
○ Data analyst write a mapper function and a reducer function
○ Specify the input data, and the framework takes care of the
rest
52. MapReduce scheduling
● One master, many workers
○ Input data split into M map tasks (typically 64 MB in size)
○ Reduce phase partitioned into R reduce tasks
○ Tasks are assigned to workers dynamically
○ Often: M=200,000; R=4,000; workers=2,000
● Master assigns each map task to a free worker
○ Considers locality of data to worker when assigning task
○ Worker reads task input (often from local disk!) to produce R
local files containing intermediate k/v pairs
● Master assigns each reduce task to a free worker
○ Worker reads intermediate k/v pairs from map workers
○ Worker sorts & applies user’s Reduce function to produce the
output
53. Some features of MapReduce
● Backup tasks can be spawned near end of completion –
don't get stuck with hang tasks.
● Mapper often reads from local disk – schedule process to
data, reduce network load.
● Robust fault tolerance:
○ Detect worker failure via periodic heartbeats
○ Re-execute map tasks and/or reduce tasks as necessary
○ Master state is checkpointed to Google File System: new
master recovers & continues
○ Lost 1600 of 1800 machines once, but finished fine.
54. Some published usage statistics
Number of jobs
Aug, ‘04
29,423
Mar, ‘05
72,229
Mar, ‘06
171,834
Average completion time (secs) 634 934 874
Machine years used 217 981 2,002
Input data read (TB) 3,288 12,571 52,254
Intermediate data (TB) 758 2,756 6,743
Output data written (TB) 193 941 2,970
Average worker machines 157 232 268
Average worker deaths per job 1.2 1.9 5.0
Average map tasks per job 3,351 3,097 3,836
Average reduce tasks per job 55 144 147
Unique map/reduce combinations 426 411 2345
56. IF vs. PR by Bollen et. al. (2006)
● Formed a journal citation graph with 2003 ISI Journal
Citation Reports data set. 5710 journals have non-zero
citations from 2002 and 2001.
● Calculated Impact Factor and PageRank, and multiply
them as the “Y-factor”.
57. IF vs. PR in Physics
Does Impact Factor result make sense to you?
58. Issues of citation counting
● Some of the most cited journals are review articles or data
tables. They are popular and of utility values, but counting
such citations toward impact doesn't make much sense.
● Journal editors/referees may encourage authors to cite
their journals to increase impact factor (Smith 1997).
○ Spamming the citation count “at no cost”.
● Length bias: “communications” and “letters” type of journals
may cite less due to length limit.
● Many other arguments can be found on the literature.
59. Conclusions
● PageRank goes beyond link counting and forwards
standing/influence/impact of each node to the nodes it links
to.
● PageRank vector as the stationary probability vector of the
Markov chain specified by the Google matrix has many
nice properties and has been extensively studied.
● It has been applied to journal status in literatures, but didn't
catch on – probably the calculation is not simple enough
compared to impact factor and h-index.
○ But it is actually very easy to implement for small networks!
● Maybe you can give it a try on your network
60. References
● Page, Brin, Motwani, and Winograd, “The PageRank
Citation Ranking: Bringing Order to the Web,” Technical
Report. Stanford InfoLab (1998).
● L. Katz, “A new status index derived from sociometric
analysis,” Psychometrika, 18(1953), pp. 39–43.
● E. Garfield, “Citation analysis as a tool in journal
evaluation,” Science, 178(1972), pp. 471–479.
● G. Pinski, F. Narin, “Citation influence for journal
aggregates of scientific publications: Theory, with
application to the literature of physics,” Inf. Proc. and
Management, 12(1976), pp. 297–312.
● N. Geller, “On the citation influence methodology of Pinski
and Narin,” Inf. Proc. and Management, 14(1978), pp. 93–
95.
61. References (II)
● Amy N. Langville and Carl D. Meyer, “Google's PageRank
and Beyond: The Science of Search Engine Rankings”,
Princeton Press (2006).
● Jon Kleinberg, “Authoritative Sources in a Hyperlinked
Environment,” J. ACM Vol. 46, 5 (1999): 604–632.
● Ng, Zheng and Jordan, “Link analysis, eigenvectors and
stability”, IJCAI 17, 1: 903-910 (2001).
● Barabási and Albert, "Emergence of scaling in random
networks", Science, 286:509-512, October 15, 1999.
● Albert, Jeong and Barabási, "The Diameter of the WWW”,
Nature 401: 130-131 (1999).
● Bollen, Rodriquez and Van de Sompel, “Journal status”,
Scientometrics 69, 3: 669-687 (2006).
62. References (III)
● Seglen, “Why the impact factor of journals should not be
used for evaluating research”, BMJ 1997;314:497.
● Smith, R., “Journal accused of manipulating impact factor”,
BMJ 1997;314:7079.