#Graphorum
Produced by
#Graphorum
Graph Techniques for Natural
Language Processing
Sujit Pal, Elsevier Labs
#Graphorum
Who am I?
• (Mostly self taught) data scientist
• Work at Elsevier Labs
• Worked with Deep Learning, Machine Learning, Natural Language
Processing, Search, Backend Web Development, Database
Administration, and Unix System Administration in reverse
chronological order.
• Took Graph Theory in college
• Rekindled interest after Social Network Analysis course on Coursera
• Interested in applications of Graph techniques to NLP
2
#Graphorum
NLP Today
Image Credit: https://www.kaggle.com/general/76963
3
#Graphorum
Typical NLP + Graph problems
• Represent text units as nodes and (similarity based) relationships as
edges in graph
• Leverage intrinsic or extrinsic graphical structure of data
• Intrinsic – co-citations and co-mentions in academic graph
• Extrinsic – text data from social networks
• Leverage external graph structure such as Knowledge Graph to
improve results for NLP task
4
#Graphorum
Case Studies
• Summarization using network metrics
• Document Clustering using Random Walk
• Word Sense Disambiguation using Label Propagation
• Incorporating external knowledge for Topic Identification
5
#Graphorum
Matrices and Graphs are Interchangeable
6
• Text elements => vectors
• Collection of elements =>
matrix
• Similarity = operation on
pairwise rows of matrix
• Convert to graph
• Graph Methods!
#Graphorum
Case Study #1
Full paper: https://www.sciencedirect.com/science/article/pii/S0020025508004520
7
#Graphorum
Case Study #1: Steps
• Create graph – sentences are nodes, edges connect sentences that
share common meaningful nouns
• Develop 14 summarizers (CN-SUMM) based on various graph metrics,
each summarizer produces a ranked list of sentences
• Voting based ensemble (CN-VOTING), ranks sentences with sum of
rankings from each of the 14 summarizers
• Return top ranked sentences from CN-VOTING as summary
8
#Graphorum
Case Study #1: Implementation
• Extract common nouns from sentences and compute similarity as
overlap
• Construct graph of sentences
• Compute Degree, Strength, Closeness, and PageRank centrality scores
per node, Shortest Path from each node to every other node, D-Ring,
k-Core, and w-Cuts, determine K most central nodes by each measure
• Ensemble predictions using Voting to produce summary sentences
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/01-
doc-summarization
9
#Graphorum
Case Study #1: Degree and Strength
10
• Degree – number of edges incident on
a vertex, measured by Degree
Centrality
• Strength – sum of edge weights
incident on the vertex, measured by
Weighted Degree Centrality
#Graphorum
Case Study #1: Closeness
11
• Closeness Centrality measures how
efficiently a vertex is able to spread
information across the network
• Defined as average “farness” (inverse
distance) to all other nodes
#Graphorum
Case Study #1: PageRank
12
• Popularized by Google’s Brin and Page
• Quality and number of in-links to a page is
rough estimate of page quality
• Iterative procedure, until convergence
• Starts with all nodes having same rank
• “Surfer” starts on random page
• Chooses a page randomly from among its
outlinks
• With probability (1-d) (d=0.15 for web)
jump to some random page on web
#Graphorum
Case Study #1: Shortest Paths
13
• Mean shortest path from each node to every other node
• Compute all-pairs shortest paths
• Algorithm uses linear number of matrix multiplication
• Order is O(V4)
• Introduced by Shimbel (1953)
• Compute mean shortest path from each node to all other nodes
• An indirect measure of centrality
#Graphorum
Case Study #1: D-Ring
14
• Create subgraphs by dilating
• Start with highest degree (or position)
• D-ring is difference of subgraphs created
by consecutive dilations
• Continue to add D-rings until enough
nodes available for summary
• Two measures of centrality – CN-RingL
and CN-RingK
#Graphorum
Case Study #1: K-Core
15
• Create subgraph starting with node with
highest degree (or position) k
• Relax threshold for k and continue adding
nodes until there are enough nodes for the
summary
• Two measures of centrality based on
degree or position – CN-CoreK and CN-
CoreL
#Graphorum
Case Study #1: W-Cuts
16
• Create subgraph starting with node
pair with highest edge weight
• Relax edge weight threshold and
continue adding nodes until enough
nodes available for summary
• Two measures of centrality – CN-CutK
and CN-CutL, based on preference
given to position or degree
#Graphorum
Case Study #1: Results
17
#Graphorum
Case Study #1: Closing Thoughts
• Generated summaries are good, but biased towards longer sentences
• Strategy described above can be extended to multi-document
summarization as well, e.g., summary of product reviews.
• A variant of the strategy described is used in the gensim summarizer.
18
#Graphorum
End of Case Study #1
19
#Graphorum
Case Study #2
Full paper: https://www.aclweb.org/anthology/N06-1061
20
#Graphorum
Case Study #2: Steps
• Paper asserts that Language Model based graph is more effective for clustering
than TD matrix based graph
• Represent each document in corpus as a node, edges connect documents by
cosine similarity of TF-IDF document vectors
• Compute t-step (t=1, 2, 3) random walks for each node, considering only top k
edges (for k ~ 80), and compute generation probabilities
• Cluster resulting graph of document generation probabilities with k-means and
Louvain Modularity
21
#Graphorum
Case Study #2: Implementation
• 20-Newsgroup dataset (18k newsgroup postings, 20 categories)
• Clean text and construct TD matrix
• Construct cosine similarity matrix S, sparsify using top generators (c=80), remove self-edges, and
renormalize
• Run (c=80) random walks on each node for path length = 1, 2, 3
• Compute empirical transition probability matrix (language model!) G from walks
• Construct graph, apply Louvain Community Detection on various graphs
• Compare against K-Means clusters from various document vectors
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/02-docs-clustering
22
#Graphorum
Case Study #2: TD Matrix to Cosine Similarity
23
Image Credit: https://www.quora.com/What-is-a-tf-idf-vector
• Documents represented as TD Matrix
(n documents x t features)
• Similarity Matrix (n x n) = TD Matrix
(n x t) times its transpose (t x n),
divided by |T| to keep similarity
values in range (0, 1)
#Graphorum
Case Study #2: Random Walks
24
Image Credit: https://snap.stanford.edu/node2vec/
• Probabilistic technique used to
“flatten” graph into feature vector
• Intuition – similar nodes are closer to
each other in the graph than
dissimilar nodes
• Compute empirical generation
probabilities
• Other popular applications --
DeepWalk and node2vec
#Graphorum
Case Study #2: Louvain Modularity
25
• Community Detection Algorithm – maximize
modularity score for each community
• Modularity = difference between actual number
of edges between node pair and expected
number of edges, summed over all nodes in
community
• Iterative procedure, run till convergence
• Greedily assign nodes to communities,
optimizing local modularity
• Define a coarse grained network of
communities
#Graphorum
Case Study #2: Results
• Silhouette Score = tightness /
separation
• Baseline – TD + K-means + Labels close
to 0
• G1, G2, G3 – LM Matrices for n=1,
n={1,2}, and n={1,2,3}.
• LM based graphs outperform TD
matrix based graphs
• Louvain outperforms K-Means
26
#Graphorum
Case Study #2: Closing Thoughts
• Transforming the graph to have edges based on transition
probabilities based on random walks yields better clustering results.
• Random Walks on graph structures often used to “flatten” the graph
and expose higher-order proximity dependencies that can sometimes
look like semantic similarity
• Community Detection algorithms can be used for clustering, and
often produce more explainable clusters
27
#Graphorum
End of Case Study #2
28
#Graphorum
Case Study #3
Full paper: https://www.aclweb.org/anthology/P05-1049
29
#Graphorum
Case Study #3: Steps
• Choose ambiguous word of interest (https://muse.dillfrog.com/lists/ambiguous)
• Find sentences containing ambiguous word from large corpus
• Manually assign labels to some sentences
• Featurize each sentence using POS of neighboring words, unigrams, and local
collocations
• Create graph with sentences as nodes, edges weighted by cosine similarity and JS
divergence of feature vectors
• Propagate Labels till convergence
• Generate word sense clusters
30
#Graphorum
Case Study #3: Implementation
• We selected the ambiguous word “compound” with these 2 senses
• Chemical compound
• Composite or multiple
• Extracted 670 sentences containing “compound” from SD corpus
• Manually marked up 40 total sentences (19 + 21) ~ 5% of corpus
• Created TD matrix of 1..3 grams + 3-gram POS tags, sparsified (k=5), removed self-
edges, and created graph
• Run Label Propagation to propagate the 40 labels to unlabeled sentences
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/03-word-sense-
disambiguation
31
#Graphorum
Case Study #3: Label Propagation
32
• Label Propagation uses network structure to detect
communities.
• Used here in semi-supervised manner by specifying
labels for a small subset of nodes
• Iterative algorithm
• Initialize nodes each with unique label
• Each node updates its labels to the most frequent
label of its neighbors
• Converges when each node has the most
frequent label of its neighbors
• Not guaranteed to converge!
#Graphorum
Case Study #3: Results
• Of 623 unlabeled sentences, Label Propagation predicts 319 sentences use the
first sense (chemical compound), 7 use the second sense (composite), and misses
298
• Misses are mostly chemical compounds (sense 1)
• Examples:
• Sense #1: ORTEP view of the compound [CuL8(ClO4)2] with the numbering
scheme adopted.
• Sense #2: Sensitive to compound fluorescence.
• Results can probably be improved – tried increasing initial labels, and by starting
with denser networks (so LP does not terminate as quickly)
33
#Graphorum
End of Case Study #3
34
#Graphorum
Case Study #4
Full paper: https://www.aclweb.org/anthology/W09-1126
35
#Graphorum
Case Study #4: Steps
• Build up in-memory graph structure for Knowledge Graph (KG)
• Match phrases in document against KG entries
• Compute Personalized PageRank (PPR) biased to matched nodes
• Roll up top scored concepts from PPR to category concepts
• Report top category concepts as document topics
36
#Graphorum
Case Study #4: Implementation
• Annotate ScienceDaily article against Aho-Corasick dictionary of KG concepts
• Using company proprietary KG to build graph, 2 versions
• Lateral relations only
• isChildOf (child -> parent) relations only
• Run Personalized PageRank (PPR) against Lateral Relations graph setting source
nodes to concepts found in article
• Roll up high PPR score concepts to disease category concepts
• Top disease category concepts are document topic labels
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/04-topic-
identification
37
#Graphorum
Case Study #4: Aho-Corasick Matching
38
• Inverted index of terms to
concept ID stored in trie-like
data structure, where every
node is a token in phrase
representing concept name
• Document streamed against
this data structure to produce
list of phrases in document
matched against concepts in
dictionary
Image Credit: https://brunorb.com/aho-corasick/
#Graphorum
Case Study #4: Personalized PageRank
39
• In PageRank, surfer doing random walk on graph jumps to some random point in
the graph with some probability d (d=0.15 for web graphs)
• In Personalized PageRank (PPR), surfer will jump to a neighborhood of the graph
specified by a set of nodes (source nodes)
• Overall effect is to assign high PPR to nodes that are in close proximity to the
source nodes.
• Personalized PageRank has been found to be an effective measure for
recommendation systems
#Graphorum
Case Study #4: Disease Categories
40
• Disease Category Concepts are
children of Diseases Concept
• Navigate to parent from
Discovered Concepts until a
Disease Category node is found
(or no parents are found)
• Roll up discovered concepts to
their Disease Categories – these
are the Document Topics
#Graphorum
Case Study #4: Results
41
#Graphorum
Case Study #4: Closing Thoughts
• Topic predictions from rolling up high PPR concepts are serendipitous,
but not necessarily complete
• Better results if combined with topic predictions obtained from rolling
up concepts found in article
42
#Graphorum
End of Case Study #4
43
#Graphorum
Summing up
• Content features and graph structure often reinforce each other
• Can be useful for unsupervised and semi-supervised NLP tasks
• Not necessarily an either-or – BERT based models can coexist with
Graph techniques
44
#Graphorum
Tools
• Originally planned to use Spark + GraphFrames for large graphs and
Neo4j for small / medium graphs
• Neo4j worked well for largest graph (500 K nodes, 1.3 M edges)
• Neo4j algorithms frequently have more functionality
• Allows multiple source nodes for Parallel PageRank
• Allows weighted edges in Label Propagation
• Ended up using Neo4j for all case studies
45
#Graphorum
Reading List
46
#Graphorum
Thank you
• My contact information
• Email: sujit.pal@elsevier.com
• LinkedIn: https://www.linkedin.com/in/sujitpal/
• Twitter: https://twitter.com/palsujit
• Blog: http://sujitpal.blogspot.com/
• Code for this presentation:
• https://github.com/sujitpal/nlp-graph-examples
47

Graph Techniques for Natural Language Processing

  • 1.
    #Graphorum Produced by #Graphorum Graph Techniquesfor Natural Language Processing Sujit Pal, Elsevier Labs
  • 2.
    #Graphorum Who am I? •(Mostly self taught) data scientist • Work at Elsevier Labs • Worked with Deep Learning, Machine Learning, Natural Language Processing, Search, Backend Web Development, Database Administration, and Unix System Administration in reverse chronological order. • Took Graph Theory in college • Rekindled interest after Social Network Analysis course on Coursera • Interested in applications of Graph techniques to NLP 2
  • 3.
    #Graphorum NLP Today Image Credit:https://www.kaggle.com/general/76963 3
  • 4.
    #Graphorum Typical NLP +Graph problems • Represent text units as nodes and (similarity based) relationships as edges in graph • Leverage intrinsic or extrinsic graphical structure of data • Intrinsic – co-citations and co-mentions in academic graph • Extrinsic – text data from social networks • Leverage external graph structure such as Knowledge Graph to improve results for NLP task 4
  • 5.
    #Graphorum Case Studies • Summarizationusing network metrics • Document Clustering using Random Walk • Word Sense Disambiguation using Label Propagation • Incorporating external knowledge for Topic Identification 5
  • 6.
    #Graphorum Matrices and Graphsare Interchangeable 6 • Text elements => vectors • Collection of elements => matrix • Similarity = operation on pairwise rows of matrix • Convert to graph • Graph Methods!
  • 7.
    #Graphorum Case Study #1 Fullpaper: https://www.sciencedirect.com/science/article/pii/S0020025508004520 7
  • 8.
    #Graphorum Case Study #1:Steps • Create graph – sentences are nodes, edges connect sentences that share common meaningful nouns • Develop 14 summarizers (CN-SUMM) based on various graph metrics, each summarizer produces a ranked list of sentences • Voting based ensemble (CN-VOTING), ranks sentences with sum of rankings from each of the 14 summarizers • Return top ranked sentences from CN-VOTING as summary 8
  • 9.
    #Graphorum Case Study #1:Implementation • Extract common nouns from sentences and compute similarity as overlap • Construct graph of sentences • Compute Degree, Strength, Closeness, and PageRank centrality scores per node, Shortest Path from each node to every other node, D-Ring, k-Core, and w-Cuts, determine K most central nodes by each measure • Ensemble predictions using Voting to produce summary sentences • See https://github.com/sujitpal/nlp-graph-examples/tree/master/01- doc-summarization 9
  • 10.
    #Graphorum Case Study #1:Degree and Strength 10 • Degree – number of edges incident on a vertex, measured by Degree Centrality • Strength – sum of edge weights incident on the vertex, measured by Weighted Degree Centrality
  • 11.
    #Graphorum Case Study #1:Closeness 11 • Closeness Centrality measures how efficiently a vertex is able to spread information across the network • Defined as average “farness” (inverse distance) to all other nodes
  • 12.
    #Graphorum Case Study #1:PageRank 12 • Popularized by Google’s Brin and Page • Quality and number of in-links to a page is rough estimate of page quality • Iterative procedure, until convergence • Starts with all nodes having same rank • “Surfer” starts on random page • Chooses a page randomly from among its outlinks • With probability (1-d) (d=0.15 for web) jump to some random page on web
  • 13.
    #Graphorum Case Study #1:Shortest Paths 13 • Mean shortest path from each node to every other node • Compute all-pairs shortest paths • Algorithm uses linear number of matrix multiplication • Order is O(V4) • Introduced by Shimbel (1953) • Compute mean shortest path from each node to all other nodes • An indirect measure of centrality
  • 14.
    #Graphorum Case Study #1:D-Ring 14 • Create subgraphs by dilating • Start with highest degree (or position) • D-ring is difference of subgraphs created by consecutive dilations • Continue to add D-rings until enough nodes available for summary • Two measures of centrality – CN-RingL and CN-RingK
  • 15.
    #Graphorum Case Study #1:K-Core 15 • Create subgraph starting with node with highest degree (or position) k • Relax threshold for k and continue adding nodes until there are enough nodes for the summary • Two measures of centrality based on degree or position – CN-CoreK and CN- CoreL
  • 16.
    #Graphorum Case Study #1:W-Cuts 16 • Create subgraph starting with node pair with highest edge weight • Relax edge weight threshold and continue adding nodes until enough nodes available for summary • Two measures of centrality – CN-CutK and CN-CutL, based on preference given to position or degree
  • 17.
  • 18.
    #Graphorum Case Study #1:Closing Thoughts • Generated summaries are good, but biased towards longer sentences • Strategy described above can be extended to multi-document summarization as well, e.g., summary of product reviews. • A variant of the strategy described is used in the gensim summarizer. 18
  • 19.
  • 20.
    #Graphorum Case Study #2 Fullpaper: https://www.aclweb.org/anthology/N06-1061 20
  • 21.
    #Graphorum Case Study #2:Steps • Paper asserts that Language Model based graph is more effective for clustering than TD matrix based graph • Represent each document in corpus as a node, edges connect documents by cosine similarity of TF-IDF document vectors • Compute t-step (t=1, 2, 3) random walks for each node, considering only top k edges (for k ~ 80), and compute generation probabilities • Cluster resulting graph of document generation probabilities with k-means and Louvain Modularity 21
  • 22.
    #Graphorum Case Study #2:Implementation • 20-Newsgroup dataset (18k newsgroup postings, 20 categories) • Clean text and construct TD matrix • Construct cosine similarity matrix S, sparsify using top generators (c=80), remove self-edges, and renormalize • Run (c=80) random walks on each node for path length = 1, 2, 3 • Compute empirical transition probability matrix (language model!) G from walks • Construct graph, apply Louvain Community Detection on various graphs • Compare against K-Means clusters from various document vectors • See https://github.com/sujitpal/nlp-graph-examples/tree/master/02-docs-clustering 22
  • 23.
    #Graphorum Case Study #2:TD Matrix to Cosine Similarity 23 Image Credit: https://www.quora.com/What-is-a-tf-idf-vector • Documents represented as TD Matrix (n documents x t features) • Similarity Matrix (n x n) = TD Matrix (n x t) times its transpose (t x n), divided by |T| to keep similarity values in range (0, 1)
  • 24.
    #Graphorum Case Study #2:Random Walks 24 Image Credit: https://snap.stanford.edu/node2vec/ • Probabilistic technique used to “flatten” graph into feature vector • Intuition – similar nodes are closer to each other in the graph than dissimilar nodes • Compute empirical generation probabilities • Other popular applications -- DeepWalk and node2vec
  • 25.
    #Graphorum Case Study #2:Louvain Modularity 25 • Community Detection Algorithm – maximize modularity score for each community • Modularity = difference between actual number of edges between node pair and expected number of edges, summed over all nodes in community • Iterative procedure, run till convergence • Greedily assign nodes to communities, optimizing local modularity • Define a coarse grained network of communities
  • 26.
    #Graphorum Case Study #2:Results • Silhouette Score = tightness / separation • Baseline – TD + K-means + Labels close to 0 • G1, G2, G3 – LM Matrices for n=1, n={1,2}, and n={1,2,3}. • LM based graphs outperform TD matrix based graphs • Louvain outperforms K-Means 26
  • 27.
    #Graphorum Case Study #2:Closing Thoughts • Transforming the graph to have edges based on transition probabilities based on random walks yields better clustering results. • Random Walks on graph structures often used to “flatten” the graph and expose higher-order proximity dependencies that can sometimes look like semantic similarity • Community Detection algorithms can be used for clustering, and often produce more explainable clusters 27
  • 28.
  • 29.
    #Graphorum Case Study #3 Fullpaper: https://www.aclweb.org/anthology/P05-1049 29
  • 30.
    #Graphorum Case Study #3:Steps • Choose ambiguous word of interest (https://muse.dillfrog.com/lists/ambiguous) • Find sentences containing ambiguous word from large corpus • Manually assign labels to some sentences • Featurize each sentence using POS of neighboring words, unigrams, and local collocations • Create graph with sentences as nodes, edges weighted by cosine similarity and JS divergence of feature vectors • Propagate Labels till convergence • Generate word sense clusters 30
  • 31.
    #Graphorum Case Study #3:Implementation • We selected the ambiguous word “compound” with these 2 senses • Chemical compound • Composite or multiple • Extracted 670 sentences containing “compound” from SD corpus • Manually marked up 40 total sentences (19 + 21) ~ 5% of corpus • Created TD matrix of 1..3 grams + 3-gram POS tags, sparsified (k=5), removed self- edges, and created graph • Run Label Propagation to propagate the 40 labels to unlabeled sentences • See https://github.com/sujitpal/nlp-graph-examples/tree/master/03-word-sense- disambiguation 31
  • 32.
    #Graphorum Case Study #3:Label Propagation 32 • Label Propagation uses network structure to detect communities. • Used here in semi-supervised manner by specifying labels for a small subset of nodes • Iterative algorithm • Initialize nodes each with unique label • Each node updates its labels to the most frequent label of its neighbors • Converges when each node has the most frequent label of its neighbors • Not guaranteed to converge!
  • 33.
    #Graphorum Case Study #3:Results • Of 623 unlabeled sentences, Label Propagation predicts 319 sentences use the first sense (chemical compound), 7 use the second sense (composite), and misses 298 • Misses are mostly chemical compounds (sense 1) • Examples: • Sense #1: ORTEP view of the compound [CuL8(ClO4)2] with the numbering scheme adopted. • Sense #2: Sensitive to compound fluorescence. • Results can probably be improved – tried increasing initial labels, and by starting with denser networks (so LP does not terminate as quickly) 33
  • 34.
  • 35.
    #Graphorum Case Study #4 Fullpaper: https://www.aclweb.org/anthology/W09-1126 35
  • 36.
    #Graphorum Case Study #4:Steps • Build up in-memory graph structure for Knowledge Graph (KG) • Match phrases in document against KG entries • Compute Personalized PageRank (PPR) biased to matched nodes • Roll up top scored concepts from PPR to category concepts • Report top category concepts as document topics 36
  • 37.
    #Graphorum Case Study #4:Implementation • Annotate ScienceDaily article against Aho-Corasick dictionary of KG concepts • Using company proprietary KG to build graph, 2 versions • Lateral relations only • isChildOf (child -> parent) relations only • Run Personalized PageRank (PPR) against Lateral Relations graph setting source nodes to concepts found in article • Roll up high PPR score concepts to disease category concepts • Top disease category concepts are document topic labels • See https://github.com/sujitpal/nlp-graph-examples/tree/master/04-topic- identification 37
  • 38.
    #Graphorum Case Study #4:Aho-Corasick Matching 38 • Inverted index of terms to concept ID stored in trie-like data structure, where every node is a token in phrase representing concept name • Document streamed against this data structure to produce list of phrases in document matched against concepts in dictionary Image Credit: https://brunorb.com/aho-corasick/
  • 39.
    #Graphorum Case Study #4:Personalized PageRank 39 • In PageRank, surfer doing random walk on graph jumps to some random point in the graph with some probability d (d=0.15 for web graphs) • In Personalized PageRank (PPR), surfer will jump to a neighborhood of the graph specified by a set of nodes (source nodes) • Overall effect is to assign high PPR to nodes that are in close proximity to the source nodes. • Personalized PageRank has been found to be an effective measure for recommendation systems
  • 40.
    #Graphorum Case Study #4:Disease Categories 40 • Disease Category Concepts are children of Diseases Concept • Navigate to parent from Discovered Concepts until a Disease Category node is found (or no parents are found) • Roll up discovered concepts to their Disease Categories – these are the Document Topics
  • 41.
  • 42.
    #Graphorum Case Study #4:Closing Thoughts • Topic predictions from rolling up high PPR concepts are serendipitous, but not necessarily complete • Better results if combined with topic predictions obtained from rolling up concepts found in article 42
  • 43.
  • 44.
    #Graphorum Summing up • Contentfeatures and graph structure often reinforce each other • Can be useful for unsupervised and semi-supervised NLP tasks • Not necessarily an either-or – BERT based models can coexist with Graph techniques 44
  • 45.
    #Graphorum Tools • Originally plannedto use Spark + GraphFrames for large graphs and Neo4j for small / medium graphs • Neo4j worked well for largest graph (500 K nodes, 1.3 M edges) • Neo4j algorithms frequently have more functionality • Allows multiple source nodes for Parallel PageRank • Allows weighted edges in Label Propagation • Ended up using Neo4j for all case studies 45
  • 46.
  • 47.
    #Graphorum Thank you • Mycontact information • Email: sujit.pal@elsevier.com • LinkedIn: https://www.linkedin.com/in/sujitpal/ • Twitter: https://twitter.com/palsujit • Blog: http://sujitpal.blogspot.com/ • Code for this presentation: • https://github.com/sujitpal/nlp-graph-examples 47