Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
Originally presented at DataDay Texas in Austin, this presentation shows how a graph database such as Neo4j can be used for common natural language processing tasks, such as building a word adjacency graph, mining word associations, summarization and keyword extraction and content recommendation.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
Originally presented at DataDay Texas in Austin, this presentation shows how a graph database such as Neo4j can be used for common natural language processing tasks, such as building a word adjacency graph, mining word associations, summarization and keyword extraction and content recommendation.
Deep Learning for Natural Language ProcessingJonathan Mugan
Deep Learning represents a significant advance in artificial intelligence because it enables computers to represent concepts using vectors instead of symbols. Representing concepts using vectors is particularly useful in natural language processing, and this talk will elucidate those benefits and provide an understandable introduction to the technologies that make up deep learning. The talk will outline ways to get started in deep learning, and it will conclude with a discussion of the gaps that remain between our current technologies and true computer understanding.
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
Graph databases are a type of NoSQL database that use a graph data model and can be used in a variety of natural language processing techniques.
During this webinar, William Lyon (Developer Relations Enginner, Neo4j) provided an overview of graph databases, followed by a survey of the role for graph databases in natural language processing tasks, including modeling text as a graph, mining word associations from a text corpus using a graph data model, and, mining opinions from a corpus of product reviews. He concluded with a demonstration of how graphs can enable content recommendation based on keyword extraction.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Recent natural language processing advancements have propelled search engine and information retrieval innovations into the public spotlight. People want to be able to interact with their devices in a natural way. In this talk I will be introducing you to natural language search using a Neo4j graph database. I will show you how to interact with an abstract graph data structure using natural language and how this approach is key to future innovations in the way we interact with our devices.
Deep Learning for Natural Language ProcessingJonathan Mugan
Deep Learning represents a significant advance in artificial intelligence because it enables computers to represent concepts using vectors instead of symbols. Representing concepts using vectors is particularly useful in natural language processing, and this talk will elucidate those benefits and provide an understandable introduction to the technologies that make up deep learning. The talk will outline ways to get started in deep learning, and it will conclude with a discussion of the gaps that remain between our current technologies and true computer understanding.
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
Graph databases are a type of NoSQL database that use a graph data model and can be used in a variety of natural language processing techniques.
During this webinar, William Lyon (Developer Relations Enginner, Neo4j) provided an overview of graph databases, followed by a survey of the role for graph databases in natural language processing tasks, including modeling text as a graph, mining word associations from a text corpus using a graph data model, and, mining opinions from a corpus of product reviews. He concluded with a demonstration of how graphs can enable content recommendation based on keyword extraction.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Recent natural language processing advancements have propelled search engine and information retrieval innovations into the public spotlight. People want to be able to interact with their devices in a natural way. In this talk I will be introducing you to natural language search using a Neo4j graph database. I will show you how to interact with an abstract graph data structure using natural language and how this approach is key to future innovations in the way we interact with our devices.
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering, game design, game development, sviluppo videogiochi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
My talk about PageRank in the Department of Electrical Engineering, National Taiwan University in 2010. Caveat: many math symbols got scrambled, damn software backward compatibility.
Gaining, retaining and losing influence in online communitiesjoinson
My keynote presentation: 'Gaining, retaining and losing influence in online communities' from a conference at Kings College, London on the topic of 'social influence in the information age'
Ariadne's Thread -- Exploring a world of networked information built from fre...Shenghui Wang
Most of the current interfaces to digital libraries are built on keyword-based search and list-based presentation. For users who do not have specific items to search for but would rather explore not-yet-familiar topics, it is not easy to figure out to what extend and on which aspects the returned records match the query. Users have to try different combinations of keywords to narrow down or broaden the search space in the hope of getting useful results in the end. In this talk, we will present a web interface that provides users an opportunity to interactively and visually explore the context of queries. In this interface, after entering a query, a contextual view about the query is visualised, where the most related journals, authors, subject headings, publishers, topical terms, etc. are positioned in 2D based on their relatedness to the query and among each other. By clicking any of these nodes, a new visualisation about the selected one is presented. With this click-through style, the users could get visual contexts about their selected entities (journal, author, topical terms, etc.) and shift their interests by choosing interested (types of) entities to investigate further. At any stop, a search in WorldCat.org with the currently focused entity (a topical word, a author or a journal) will return the most matched results (judged by the standard WorldCat search engine).
We implemented this interface over WorldCat, the world largest bibliographic database. To guarantee the responsiveness of this interactive interface, we adopt a two-step approach: an off-line preparation phase with an on-line process. Off-line, we build the semantic representation of each entity where Random Projection is used to vigorously reduce dimensionality (from 6 million to 600). In the on-line interface terms from a query are compared to entities in the reduced semantic matrix where reciprocal relatedness is used to select genuine matches. The number of hits is further reduced to render a network layout easy to overview and navigate. In the end, we can investigate the relations between roughly 6 million topical terms, 5 million authors, 1 million subject headings 1000 Dewey decimal codes and 1.7 million publishers.
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
Graph algorithms are powerful tools, and there’s a lot of excitement about their applications for data science. It can sometimes be difficult, however - especially for those of us who aren’t data scientists - to know how they might be applied to a particular data set or a specific business problem. There are graph algorithms for centrality and importance measurement, community detection, similarity comparison, pathfinding, and link prediction. Which ones should you use on your data, and which ones might be most useful in answering your business questions?
In this presentation, we’ll look at a few examples of Neo4j graph algorithms, and see how they can be applied to data and business problems from the banking industry. We’ll discuss what kinds of data are appropriate for different types of algorithms, show how to model and structure data to work with graph algorithms, and run through some real-world scenarios demonstrating the use of graph algorithms on a sample banking data set.
Webinar with Joe Depeau, Neo4j, April 15, 2020
Predictions of links in graphs based on content and information propagations.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Everything is connected: people, information, events and places. A practical way of making sense of the tangle of connections is to analyze them as networks. The objective of this workshop is to introduce the essential concepts of Social Network Analysis (SNA). It also seeks to show how SNA may help organizations unlock and mobilize these informal networks in order to achieve sustainable strategic goals. After discussing the essential concepts in theory of SNA, the computational tools for modeling and analysis of social networks will also be introduced in this presentation.
2013 CrossRef Annual Meeting Agile Publishing Kristen RatanCrossref
The manifesto behind agile development methodology states that the highest priority is to satisfy the customer, welcome change, iterate frequently and promote dialog. If we were to adopt these principles, what would scholarly communication look like?
In social networks, where users send messages to each other, the issue of what triggers communication between unrelated users arises: does communication between previously unrelated users depend on friend-of-a-friend type of relationships, common interests, or other factors? In this work, we study the problem of predicting directed communication
intention between two users. Link prediction is similar to communication intention in that it uses network structure for prediction. However, these two problems exhibit fundamental
differences that originate from their focus. Link prediction uses evidence to predict network structure evolution, whereas our focal point is directed communication initiation between
users who are previously not structurally connected. To address this problem, we employ topological evidence in conjunction to transactional information in order to predict communication intention. It is not intuitive whether methods that work well for
link prediction would work well in this case. In fact, we show in this work that network or content evidence, when considered separately, are not sufficiently accurate predictors. Our novel approach, which jointly considers local structural properties of users in a social network, in conjunction with their generated content, captures numerous interactions, direct and indirect, social and contextual, which have up to date been considered independently. We performed an empirical study to evaluate our method using an extracted network of directed @-messages sent between users of a corporate microblogging service, which resembles Twitter. We find that our method outperforms state of the art techniques for link prediction. Our findings have implications for a wide range of social web applications, such as contextual expert recommendation for Q&A, new friendship relationships creation, and targeted content delivery.
Social Network Analysis is an innovative methodology that examines the connections between individuals or organizations. It illustrates how relationships form, the relational structures between people, and the impacts of these structures. Khulisa has conducted social network analyses for several organizations, providing compelling insights into their professional engagements and relationships.
What power law and rich get richer phenomena means in the world of network and how does it affect in the social networks for web page popularity especially in the facebook platform?
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
1. Prof. Pier Luca Lanzi
Graph Mining!
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
2. Prof. Pier Luca Lanzi
References
• Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive
Datasets, Chapter 5 & Chapter 10
• Book and slides are available from http://www.mmds.org
2
3. Prof. Pier Luca Lanzi
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
4. Prof. Pier Luca Lanzi 4
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
5. Prof. Pier Luca Lanzi 5
Citation networks and Maps of science
[Börner et al., 2012]
6. Prof. Pier Luca Lanzi
I teach a
class on
Networks.
CS224W:
Classes are
in the !
Gates
building
Computer
Science
Department
at Stanford
Stanford
University
Web as a graph: pages are nodes, edges are links
7. Prof. Pier Luca Lanzi
Web as a graph: pages are nodes, edges are links
8. Prof. Pier Luca Lanzi
How is the Web Organized?
• Initial approaches
! Human curated Web directories
! Yahoo, DMOZ, LookSmart
• Then, Web search
! Information Retrieval investigates:!
Find relevant docs in a small !
and trusted set
! Newspaper articles, Patents, etc.
8
Web is huge, full of untrusted documents,!
random things, web spam, etc.
9. Prof. Pier Luca Lanzi
Web Search Challenges
• Web contains many sources of information
! Who should we “trust”?
! Trick: Trustworthy pages may point to each other!
• What is the “best” answer to query “newspaper”?
! No single right answer
! Trick: Pages that actually know about newspapers!
might all be pointing to many newspapers
9
11. Prof. Pier Luca Lanzi
Page Rank Algorithm
• The underlying idea is to look at links as votes
• A page is more important if it has more links
! In-coming links? Out-going links?
• Intuition
! www.stanford.edu has 23,400 in-links
! www.joe-schmoe.com has one in-link
• Are all in-links are equal?
! Links from important pages count more
! Recursive question!
11
12. Prof. Pier Luca Lanzi
B
38.4
C
34.3
E
8.1
F
3.9
D
3.9
A
3.3
1.6
1.6 1.6 1.6 1.6
13. Prof. Pier Luca Lanzi
Simple Recursive Formulation
• Each link’s vote is proportional!
to the importance of its source page
• If page j with importance rj has n!
out-links, each link gets rj/n votes
• Page j’s own importance is!
the sum of the votes on its in-links
13
j
ki
rj/3
rj/3rj/3
rj = ri/3+rk/4
ri/3 rk/4
14. Prof. Pier Luca Lanzi
The “Flow” Model
• A “vote” from an important page is worth more
• A page is important if it is pointed to by other important pages
• Define a “rank” rj for page j!
!
!
!
!
!
where di is the out-degree of node i
• “Flow” equations
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
14
∑→
=
ji
i
j
r
r
id y
ma
a/2
y/2
a/2
m
y/2
15. Prof. Pier Luca Lanzi
Solving the Flow Equations
• The equations, three unknowns variables, no constant
! No unique solution
! All solutions equivalent modulo the scale factor
• An additional constraint (ry+ra+rm=1) forces uniqueness
• Gaussian elimination method works for small examples, but we
need a better method for large web-size graphs
15
We need a different formulation that scales up!
16. Prof. Pier Luca Lanzi
The Matrix Formulation
• Represent the graph as a transition matrix M
! Suppose page i has di out-links
! If page i is linked to page j Mji is set to 1/di else Mji=0
! M is a “column stochastic matrix” since!
the columns sum up to 1
• Given the rank vector r with an entry per page, where ri is the
importance of page i and the ri sum up to one
16
∑→
=
ji
i
j
r
r
id
The flow equation can be written as r = Mr
17. Prof. Pier Luca Lanzi
The Eigenvector Formulation
• Since the flow equation can be written as r = Mr, the rank vector
r is also an eigenvector of M
• Thus, we can solve for r using a simple iterative scheme!
(“power iteration”)
• Power iteration: a simple iterative scheme
! Suppose there are N web pages
! Initialize: r(0) = [1/N,….,1/N]T
! Iterate: r(t+1) = M r(t)
! Stop when |r(t+1) – r(t)|1 < ε
17
18. Prof. Pier Luca Lanzi
The Random Walk Formulation
• Suppose that a random surfer that at time t is on page i and will
continue it navigation by following one of the out-link at random
• At time t+1, will end up on page j and from there it will continue
the random surfing indefinitely
• Let p(t) the vector of probabilities pi(t) that the surfer is on !
page i at time t (p(t) is the probability distribution over pages)
• Then, p(t+1) = Mp(t) so that
18
p(t) is the stationary distribution for the random walk
19. Prof. Pier Luca Lanzi
Existence and Uniqueness!
For graphs that satisfy certain conditions,!
the stationary distribution is unique and!
eventually will be reached no matter!
what the initial probability distribution is
21. Prof. Pier Luca Lanzi
Hubs and Authorities
• HITS (Hypertext-Induced Topic Selection)
! Is a measure of importance of pages and!
documents, similar to PageRank
! Proposed at around same time as PageRank (1998)
• Goal: Say we want to find good newspapers
! Don’t just find newspapers. Find “experts”, that is, !
people who link in a coordinated way to good newspapers
• The idea is similar, links are viewed as votes
! Page is more important if it has more links
! In-coming links? Out-going links?
21
22. Prof. Pier Luca Lanzi
Hubs and Authorities
• Each page has 2 scores
• Quality as an expert (hub)
! Total sum of votes of authorities pointed to
• Quality as a content (authority)
! Total sum of votes coming from experts
• Principle of repeated improvement
22
23. Prof. Pier Luca Lanzi
Hubs and Authorities
• Authorities are pages containing !
useful information
! Newspaper home pages
! Course home pages
! Home pages of auto manufacturers
• Hubs are pages that link to authorities
! List of newspapers
! Course bulletin
! List of US auto manufacturers
23
24. Prof. Pier Luca Lanzi
Counting in-links: Authority 24
(Note this is idealized example. In reality graph is not bipartite and each page
has both the hub and authority score)
Each page starts with hub score
1.Authorities collect their votes
25. Prof. Pier Luca Lanzi
Counting in-links: Authority
25
Sum of hub
scores of nodes
pointing to NYT.
Each page starts with hub score
1.Authorities collect their votes
25
26. Prof. Pier Luca Lanzi
Expert Quality: Hub
26
Hubs collect authority scores
Sum of authority
scores of nodes that
the node points to.
26
27. Prof. Pier Luca Lanzi
Reweighting
27
Authorities again collect !
the hub scores
27
28. Prof. Pier Luca Lanzi
Mutually Recursive Definition
• A good hub links to many good authorities
• A good authority is linked from many good hubs
• Model using two scores for each node:
! Hub score and Authority score
! Represented as vectors and
28
28
29. Prof. Pier Luca Lanzi
The HITS Algorithm
• Initialize scores
• Iterate until convergence:
! Update authority scores
! Update hub scores
! Normalize
• Two vectors a = (a1, …, an) and h=(h1, …, hn) and !
the adjacency matrix A, with Aij=1 is 1 if i connects to j !
are connected, 0 otherwise
29
30. Prof. Pier Luca Lanzi
The HITS Algorithm (vector notation)
• Set ai = hi = 1/ n
• Repeat until convergence
! h = Aa
! a = ATh
• Convergence criteria!
!
!
!
• Under reasonable assumptions about A, HITS converges to
vectors h* and a* where
! h* is the principal eigenvector of matrix A AT
! a* is the principal eigenvector of matrix AT A
30
31. Prof. Pier Luca Lanzi
PageRank vs HITS
• PageRank and HITS are two solutions to the same problem
! What is the value of an in-link from u to v?
! In the PageRank model, the value of the link!
depends on the links into u
! In the HITS model, it depends on!
the value of the other links out of u
• The destinies of PageRank and HITS !
after 1998 were very different
31
39. Prof. Pier Luca Lanzi
Girvan-Newman Method
• Define edge betweenness as the number of shortest paths
passing over the edge
• Divisive hierarchical clustering based on the notion of edge
betweenness
• The Algorithm
! Start with an undirected graph
! Repeat until no edges are left!
Calculate betweenness of edges!
Remove edges with highest betweenness
• Connected components are communities
• Gives a hierarchical decomposition of the network
39
40. Prof. Pier Luca Lanzi
Need to re-compute
betweenness at every step
49
33
121
44. Prof. Pier Luca Lanzi
Network Communities
• Communities are viewed as sets of tightly connected nodes
• We define modularity as a measure !
of how well a network is partitioned !
into communities
• Given a partitioning of the network!
into a set of groups S we define the !
modularity Q as
44
Need a null model!
45. Prof. Pier Luca Lanzi
Modularity is useful for selecting the number of clusters:
Q
47. Prof. Pier Luca Lanzi
What Makes a Good Cluster?
• Undirected graph G(V,E)
• Partitioning task
! Divide the vertices into two disjoint!
groups A, B=VA
• Questions
! How can we define a “good partition” of G?
! How can we efficiently identify such a partition?
47
1
3
2
5
4
6
1
3
2
5
4
6
A B
48. Prof. Pier Luca Lanzi
1
3
2
5
4
6
What makes a good partition?
Maximize the number of within-group connections
Minimize the number of between-group connections
49. Prof. Pier Luca Lanzi
Graph Cuts
• Express partitioning objectives as a function of !
the “edge cut” of the partition
• Cut is defined as the set of edges with only one vertex in a group
• The cut of the set A, B is cut(A,B) = 2 or in more general
49
1
3
2
5
4
6
A B
50. Prof. Pier Luca Lanzi
Graph Cut Criterion
• Partition quality
! Minimize weight of connections between groups,!
i.e., arg minA,B cut(A,B)!
• Degenerate case:
• Problems
! Only considers external cluster connections
! Does not consider internal cluster connectivity
50
“Optimal cut”
Minimum cut
51. Prof. Pier Luca Lanzi
Graph Partitioning Criteria:!
Normalized cut (Conductance)!
• Connectivity of the group to the rest of the network should be
relative to the density of the group
• Where vol(A) is the total weight of the edges that have at least
one endpoint in A
51
54. Prof. Pier Luca Lanzi
Spectral Graph Partitioning
• Let A be the adjacent matrix of the graph G with n nodes
! Aij is 1 if there is an edge between i and j, 0 otherwise
! x a vector of n components (x1, …, xn) that represents labels/
values assigned to each node of G
! Ax returns a vector in which each component j is the sum of
the labels of the neighbors of node j
• Spectral Graph Theory
! Analyze the spectrum of G, that is, the eigenvectors xi of the
graph corresponding to the eigenvalues Λ of G sorted in
increasing order
! Λ = { λ1, …, λn} such that λ1≤λ2 ≤… ≤λn
54
55. Prof. Pier Luca Lanzi
Example: d-regular Graph
• Suppose that all the nodes in G have degree d and G is
connected
• What are the eigenvalues/eigenvectors of G? Ax=λx
! Ax returns the sum of the labels of each node’s neighbors and
since each node has exactly d neighbors, x = (1, …, 1) is an
eigenvector and d is an eigenvalue
• What if G is not connected but still d-regular
• A vector with all the ones is A and all the zeros in B (or
viceversa) is still an eigenvector of A with eigenvalue d
55
A B
56. Prof. Pier Luca Lanzi
Example: d-regular Graph (not connected)
• What if G has two separate components!
but it is still d-regular
• A vector with all the ones is A and all the !
zeros in B (or viceversa) is still an eigenvector!
of A with eigenvalue d
• Underlying intuition
56
A B
A B A B
λ1=λ2
λ1≈λ2
58. Prof. Pier Luca Lanzi
Graph Laplacian Matrix
• Computed as L = D-A
! nxn symmetric matrix
! x=(1,…,1) is a trivial eigenvector since Lx=0 so λ1=0
• Important properties of L
! Eigenvalues are non-negative!
real numbers
! Eigenvectors are real!
and orthogonal!
58
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
59. Prof. Pier Luca Lanzi
2 as optimization problem
• For symmetric matrix M,
• What is the meaning of xTLx on G? We can show that,
• So that, considering that the second eigenvector x is the unit
vector, and x is orthogonal to the unit vector (1, …, 1)
59
60. Prof. Pier Luca Lanzi
2 as optimization problem
• So that, considering that the second eigenvector x is the unit
vector, and x is orthogonal to the unit vector (1, …, 1)
• Such that,
60
61. Prof. Pier Luca Lanzi
0 x
λ2 and its eigenvector x balance to minimize
xi xj
62. Prof. Pier Luca Lanzi
Finding the Optimal Cut
• Express the partition (A,B) as a vector y where,
! yi = +1 if node i belongs to A
! yi = -1 if node i belongs to B
• We can minimize the cut of the partition by finding a non-trivial
vector that minimizes
62
Can’t solve exactly! Let’s relax y and
allow y to take any real value.
63. Prof. Pier Luca Lanzi
Rayleigh Theorem
• We know that,
• The minimum value of f(y) is given by the second smallest
eigenvalue λ2 of the Laplacian matrix L
• Thus, the optimal solution for y is given by the corresponding
eigenvector x, referred as the Fiedler vector
63
64. Prof. Pier Luca Lanzi
Spectral Clustering Algorithms
1. Pre-processing
! Construct a matrix representation of the graph
2. Decomposition
! Compute eigenvalues and eigenvectors of the matrix
! Map each point to a lower-dimensional representation based
on one or more eigenvectors
3. Grouping
! Assign points to two or more clusters, based on the new
representation
64
65. Prof. Pier Luca Lanzi
Spectral Partitioning Algorithm
• Pre-processing:
! Build Laplacian !
matrix L of the !
graph
• Decomposition:
! Find eigenvalues λ!
and eigenvectors x !
of the matrix L
! Map vertices to !
corresponding !
components of λ2
65
0.0-0.4-0.40.4-0.60.4
0.50.4-0.2-0.5-0.30.4
-0.50.40.60.1-0.30.4
0.5-0.40.60.10.30.4
0.00.4-0.40.40.60.4
-0.5-0.4-0.2-0.50.30.4
5.0
4.0
3.0
3.0
1.0
0.0
λ= X =
-0.66
-0.35
-0.34
0.33
0.62
0.31
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
66. Prof. Pier Luca Lanzi
Spectral Partitioning Algorithm
• Grouping:
! Sort components of reduced 1-dimensional vector
! Identify clusters by splitting the sorted vector in two
• How to choose a splitting point?
! Naïve approaches: split at 0 or median value
! More expensive approaches: Attempt to minimize normalized
cut in 1-dimension (sweep over ordering of nodes induced by
the eigenvector)
66
66
-0.66
-0.35
-0.34
0.33
0.62
0.31
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
0.33
0.62
0.31
-0.66
-0.35
-0.34
A B
67. Prof. Pier Luca Lanzi
Example: Spectral Partitioning 67
Rank in x2
Valueofx2
68. Prof. Pier Luca Lanzi
Example: Spectral Partitioning
68
Rank in x2
Valueofx2
Components of x2
68
69. Prof. Pier Luca Lanzi
Example: Spectral partitioning
69
Components of x1
Components of x3
69
70. Prof. Pier Luca Lanzi
How Do We Partition a Graph into k Clusters?
• Two basic approaches:
• Recursive bi-partitioning [Hagen et al., ’92]
! Recursively apply bi-partitioning algorithm!
in a hierarchical divisive manner
! Disadvantages: inefficient, unstable
• Cluster multiple eigenvectors [Shi-Malik, ’00]
! Build a reduced space from multiple eigenvectors
! Commonly used in recent papers
! A preferable approach…
70
70
71. Prof. Pier Luca Lanzi
Why Use Multiple Eigenvectors?
• Approximates the optimal cut [Shi-Malik, ’00]
! Can be used to approximate optimal k-way normalized cut
• Emphasizes cohesive clusters
! Increases the unevenness in the distribution of the data
! Associations between similar points are amplified, associations
between dissimilar points are attenuated
! The data begins to “approximate a clustering”
• Well-separated space
! Transforms data to a new “embedded space”, !
consisting of k orthogonal basis vectors
• Multiple eigenvectors prevent instability due to information loss
71
72. Prof. Pier Luca Lanzi
Searching for Small Communities!
(Trawling)
73. Prof. Pier Luca Lanzi
Searching for small communities in the Web
graph (Trawling)
• Trawling
! What is the signature of a community in a Web graph?
! The underlying intuition, that small communities involve
many people talking about the same things
! Use this to define “topics”: what the same people on
the left talk about on the right?
• More formally
! Enumerate complete bipartite subgraphs Ks,t
! Ks,t has s nodes on the “left” and t nodes on the “right”
! The left nodes link to the same node of on the right,
forming a fully connected bipartite graph
73
[Kumar et al. ‘99]
Dense 2-layer
…
…
K3,4
X Y
74. Prof. Pier Luca Lanzi
Mining Bipartite Ks,t using Frequent Itemsets
• Searching for such complete bipartite graphs can be viewed as a
frequent itemset mining problem
• View each node i as a !
set Si of nodes i points to
• Ks,t = a set Y of size t !
that occurs in s sets Si
• Looking for Ks,t is equivalento to!
settting the frequency threshold !
to s and look at layer t !
(i.e., all frequent sets of size t)
74
[Kumar et al. ‘99]
Si={a,b,c,d}
X Y
s = minimum support (|X|=s)
t = itemset size (|Y|=t)
75. Prof. Pier Luca Lanzi
Si={a,b,c,d}
X Y
Find frequent itemsets:
s … minimum support
t … itemset size
We found Ks,t!
Ks,t = a set Y of size t !
that occurs in s sets Si
View each node i as a !
set Si of nodes i points to
Say we find a frequent itemset
Y={a,b,c} of supp s; so, there are s
nodes that link to all of {a,b,c}:
76. Prof. Pier Luca Lanzi
Example
• Support threshold s=2
! {b,d}: support 3
! {e,f}: support 2
• And we just found 2 bipartite subgraphs:
76
c
a b
d
f
e
c
a b
d
e
c
d
f
e
• Itemsets
! a = {b,c,d}
! b = {d}
! c = {b,d,e,f}
! d = {e,f}
! e = {b,d}
! f = {}