Graph Techniques for Natural Language Processing

#Graphorum
Produced by
#Graphorum
Graph Techniques for Natural
Language Processing
Sujit Pal, Elsevier Labs

#Graphorum
Who am I?
• (Mostly self taught) data scientist
• Work at Elsevier Labs
• Worked with Deep Learning, Machine Learning, Natural Language
Processing, Search, Backend Web Development, Database
Administration, and Unix System Administration in reverse
chronological order.
• Took Graph Theory in college
• Rekindled interest after Social Network Analysis course on Coursera
• Interested in applications of Graph techniques to NLP
2

#Graphorum
NLP Today
Image Credit: https://www.kaggle.com/general/76963
3

#Graphorum
Typical NLP + Graph problems
• Represent text units as nodes and (similarity based) relationships as
edges in graph
• Leverage intrinsic or extrinsic graphical structure of data
• Intrinsic – co-citations and co-mentions in academic graph
• Extrinsic – text data from social networks
• Leverage external graph structure such as Knowledge Graph to
improve results for NLP task
4

#Graphorum
Case Studies
• Summarization using network metrics
• Document Clustering using Random Walk
• Word Sense Disambiguation using Label Propagation
• Incorporating external knowledge for Topic Identification
5

#Graphorum
Matrices and Graphs are Interchangeable
6
• Text elements => vectors
• Collection of elements =>
matrix
• Similarity = operation on
pairwise rows of matrix
• Convert to graph
• Graph Methods!

#Graphorum
Case Study #1
Full paper: https://www.sciencedirect.com/science/article/pii/S0020025508004520
7

#Graphorum
Case Study #1: Steps
• Create graph – sentences are nodes, edges connect sentences that
share common meaningful nouns
• Develop 14 summarizers (CN-SUMM) based on various graph metrics,
each summarizer produces a ranked list of sentences
• Voting based ensemble (CN-VOTING), ranks sentences with sum of
rankings from each of the 14 summarizers
• Return top ranked sentences from CN-VOTING as summary
8

#Graphorum
Case Study #1: Implementation
• Extract common nouns from sentences and compute similarity as
overlap
• Construct graph of sentences
• Compute Degree, Strength, Closeness, and PageRank centrality scores
per node, Shortest Path from each node to every other node, D-Ring,
k-Core, and w-Cuts, determine K most central nodes by each measure
• Ensemble predictions using Voting to produce summary sentences
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/01-
doc-summarization
9

#Graphorum
Case Study #1: Degree and Strength
10
• Degree – number of edges incident on
a vertex, measured by Degree
Centrality
• Strength – sum of edge weights
incident on the vertex, measured by
Weighted Degree Centrality

#Graphorum
Case Study #1: Closeness
11
• Closeness Centrality measures how
efficiently a vertex is able to spread
information across the network
• Defined as average “farness” (inverse
distance) to all other nodes

#Graphorum
Case Study #1: PageRank
12
• Popularized by Google’s Brin and Page
• Quality and number of in-links to a page is
rough estimate of page quality
• Iterative procedure, until convergence
• Starts with all nodes having same rank
• “Surfer” starts on random page
• Chooses a page randomly from among its
outlinks
• With probability (1-d) (d=0.15 for web)
jump to some random page on web

#Graphorum
Case Study #1: Shortest Paths
13
• Mean shortest path from each node to every other node
• Compute all-pairs shortest paths
• Algorithm uses linear number of matrix multiplication
• Order is O(V4)
• Introduced by Shimbel (1953)
• Compute mean shortest path from each node to all other nodes
• An indirect measure of centrality

#Graphorum
Case Study #1: D-Ring
14
• Create subgraphs by dilating
• Start with highest degree (or position)
• D-ring is difference of subgraphs created
by consecutive dilations
• Continue to add D-rings until enough
nodes available for summary
• Two measures of centrality – CN-RingL
and CN-RingK

#Graphorum
Case Study #1: K-Core
15
• Create subgraph starting with node with
highest degree (or position) k
• Relax threshold for k and continue adding
nodes until there are enough nodes for the
summary
• Two measures of centrality based on
degree or position – CN-CoreK and CN-
CoreL

#Graphorum
Case Study #1: W-Cuts
16
• Create subgraph starting with node
pair with highest edge weight
• Relax edge weight threshold and
continue adding nodes until enough
nodes available for summary
• Two measures of centrality – CN-CutK
and CN-CutL, based on preference
given to position or degree

#Graphorum
Case Study #1: Results
17

#Graphorum
Case Study #1: Closing Thoughts
• Generated summaries are good, but biased towards longer sentences
• Strategy described above can be extended to multi-document
summarization as well, e.g., summary of product reviews.
• A variant of the strategy described is used in the gensim summarizer.
18

#Graphorum
End of Case Study #1
19

#Graphorum
Case Study #2
Full paper: https://www.aclweb.org/anthology/N06-1061
20

#Graphorum
• Paper asserts that Language Model based graph is more effective for clustering
than TD matrix based graph
• Represent each document in corpus as a node, edges connect documents by
cosine similarity of TF-IDF document vectors
• Compute t-step (t=1, 2, 3) random walks for each node, considering only top k
edges (for k ~ 80), and compute generation probabilities
• Cluster resulting graph of document generation probabilities with k-means and
Louvain Modularity
21

#Graphorum
• 20-Newsgroup dataset (18k newsgroup postings, 20 categories)
• Clean text and construct TD matrix
• Construct cosine similarity matrix S, sparsify using top generators (c=80), remove self-edges, and
renormalize
• Run (c=80) random walks on each node for path length = 1, 2, 3
• Compute empirical transition probability matrix (language model!) G from walks
• Construct graph, apply Louvain Community Detection on various graphs
• Compare against K-Means clusters from various document vectors
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/02-docs-clustering
22

#Graphorum
Case Study #2: TD Matrix to Cosine Similarity
23
Image Credit: https://www.quora.com/What-is-a-tf-idf-vector
• Documents represented as TD Matrix
(n documents x t features)
• Similarity Matrix (n x n) = TD Matrix
(n x t) times its transpose (t x n),
divided by |T| to keep similarity
values in range (0, 1)

#Graphorum
Case Study #2: Random Walks
24
Image Credit: https://snap.stanford.edu/node2vec/
• Probabilistic technique used to
“flatten” graph into feature vector
• Intuition – similar nodes are closer to
each other in the graph than
dissimilar nodes
• Compute empirical generation
probabilities
• Other popular applications --
DeepWalk and node2vec

#Graphorum
Case Study #2: Louvain Modularity
25
• Community Detection Algorithm – maximize
modularity score for each community
• Modularity = difference between actual number
of edges between node pair and expected
number of edges, summed over all nodes in
community
• Iterative procedure, run till convergence
• Greedily assign nodes to communities,
optimizing local modularity
• Define a coarse grained network of
communities

#Graphorum
• Silhouette Score = tightness /
separation
• Baseline – TD + K-means + Labels close
to 0
• G1, G2, G3 – LM Matrices for n=1,
n={1,2}, and n={1,2,3}.
• LM based graphs outperform TD
matrix based graphs
• Louvain outperforms K-Means
26

#Graphorum
• Transforming the graph to have edges based on transition
probabilities based on random walks yields better clustering results.
• Random Walks on graph structures often used to “flatten” the graph
and expose higher-order proximity dependencies that can sometimes
look like semantic similarity
• Community Detection algorithms can be used for clustering, and
often produce more explainable clusters
27

#Graphorum
28

#Graphorum
Case Study #3
Full paper: https://www.aclweb.org/anthology/P05-1049
29

#Graphorum
• Choose ambiguous word of interest (https://muse.dillfrog.com/lists/ambiguous)
• Find sentences containing ambiguous word from large corpus
• Manually assign labels to some sentences
• Featurize each sentence using POS of neighboring words, unigrams, and local
collocations
• Create graph with sentences as nodes, edges weighted by cosine similarity and JS
divergence of feature vectors
• Propagate Labels till convergence
• Generate word sense clusters
30

#Graphorum
• We selected the ambiguous word “compound” with these 2 senses
• Chemical compound
• Composite or multiple
• Extracted 670 sentences containing “compound” from SD corpus
• Manually marked up 40 total sentences (19 + 21) ~ 5% of corpus
• Created TD matrix of 1..3 grams + 3-gram POS tags, sparsified (k=5), removed self-
edges, and created graph
• Run Label Propagation to propagate the 40 labels to unlabeled sentences
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/03-word-sense-
disambiguation
31

#Graphorum
Case Study #3: Label Propagation
32
• Label Propagation uses network structure to detect
communities.
• Used here in semi-supervised manner by specifying
labels for a small subset of nodes
• Iterative algorithm
• Initialize nodes each with unique label
• Each node updates its labels to the most frequent
label of its neighbors
• Converges when each node has the most
frequent label of its neighbors
• Not guaranteed to converge!

#Graphorum
• Of 623 unlabeled sentences, Label Propagation predicts 319 sentences use the
first sense (chemical compound), 7 use the second sense (composite), and misses
298
• Misses are mostly chemical compounds (sense 1)
• Examples:
• Sense #1: ORTEP view of the compound [CuL8(ClO4)2] with the numbering
scheme adopted.
• Sense #2: Sensitive to compound fluorescence.
• Results can probably be improved – tried increasing initial labels, and by starting
with denser networks (so LP does not terminate as quickly)
33

#Graphorum
34

#Graphorum
Case Study #4
Full paper: https://www.aclweb.org/anthology/W09-1126
35

#Graphorum
• Build up in-memory graph structure for Knowledge Graph (KG)
• Match phrases in document against KG entries
• Compute Personalized PageRank (PPR) biased to matched nodes
• Roll up top scored concepts from PPR to category concepts
• Report top category concepts as document topics
36

#Graphorum
• Annotate ScienceDaily article against Aho-Corasick dictionary of KG concepts
• Using company proprietary KG to build graph, 2 versions
• Lateral relations only
• isChildOf (child -> parent) relations only
• Run Personalized PageRank (PPR) against Lateral Relations graph setting source
nodes to concepts found in article
• Roll up high PPR score concepts to disease category concepts
• Top disease category concepts are document topic labels
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/04-topic-
identification
37

#Graphorum
Case Study #4: Aho-Corasick Matching
38
• Inverted index of terms to
concept ID stored in trie-like
data structure, where every
node is a token in phrase
representing concept name
• Document streamed against
this data structure to produce
list of phrases in document
matched against concepts in
dictionary
Image Credit: https://brunorb.com/aho-corasick/

#Graphorum
Case Study #4: Personalized PageRank
39
• In PageRank, surfer doing random walk on graph jumps to some random point in
the graph with some probability d (d=0.15 for web graphs)
• In Personalized PageRank (PPR), surfer will jump to a neighborhood of the graph
specified by a set of nodes (source nodes)
• Overall effect is to assign high PPR to nodes that are in close proximity to the
source nodes.
• Personalized PageRank has been found to be an effective measure for
recommendation systems

#Graphorum
Case Study #4: Disease Categories
40
• Disease Category Concepts are
children of Diseases Concept
• Navigate to parent from
Discovered Concepts until a
Disease Category node is found
(or no parents are found)
• Roll up discovered concepts to
their Disease Categories – these
are the Document Topics

#Graphorum
41

#Graphorum
• Topic predictions from rolling up high PPR concepts are serendipitous,
but not necessarily complete
• Better results if combined with topic predictions obtained from rolling
up concepts found in article
42

#Graphorum
43

#Graphorum
Summing up
• Content features and graph structure often reinforce each other
• Can be useful for unsupervised and semi-supervised NLP tasks
• Not necessarily an either-or – BERT based models can coexist with
Graph techniques
44

#Graphorum
Tools
• Originally planned to use Spark + GraphFrames for large graphs and
Neo4j for small / medium graphs
• Neo4j worked well for largest graph (500 K nodes, 1.3 M edges)
• Neo4j algorithms frequently have more functionality
• Allows multiple source nodes for Parallel PageRank
• Allows weighted edges in Label Propagation
• Ended up using Neo4j for all case studies
45

#Graphorum
Thank you
• My contact information
• Email: sujit.pal@elsevier.com
• LinkedIn: https://www.linkedin.com/in/sujitpal/
• Twitter: https://twitter.com/palsujit
• Blog: http://sujitpal.blogspot.com/
• Code for this presentation:
• https://github.com/sujitpal/nlp-graph-examples
47

Graph Techniques for Natural Language Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graph Techniques for Natural Language Processing

Similar to Graph Techniques for Natural Language Processing (20)

More from Sujit Pal

More from Sujit Pal (20)

Recently uploaded

Recently uploaded (20)

Graph Techniques for Natural Language Processing