ArXiv Literature Exploration
using Social Network Analysis
Tanat Iempreedee (6210422036)
Yothin Kittithorn (6210422037)
Supalerk Pisitsupakarn (6210422040)
Ratchasit Ngamsa-ardwarit (6210422060)
Business Analytics and Data Science, Applied Statistics, NIDA
TABLE OF
CONTENTS
INTRODUCTION
DATASET
01
02
03
04
ANALYSIS
CONCLUSION
INTRODUCTION
01
WHY WE SELECTED THIS PROJECT ?
Pain Point
● Searching for research papers is not easy for those who are not familiar.
● For the paper that we are studying, we might want to check on the other papers that are
citing it or cited by it as well
● Want to see similar or related papers even if we do not get the search key words right
● Which one to prioritize first?
Intro
● Exploring arXiv Citation Network using Social Network Analysis techniques
● Page Rank as the paper importance indicator
● Constructing Similarity Network by Titles’ similarity and proceed with
Spectral Clustering
● Graph clustering using unsupervised GraphSAGE
DATASET
02
DATASET
ArXiv Dataset
Source : Kaggle
arXiv Dataset (version 4)
● Metadata (1.7+ Million papers, 4.5GB)
ID, Title, Abstract, Created Date, Category
Format: JSON
● Internal Citation (171 MB)
Citation that occurred only in ArXiv
Format: JSON
(internal citation data is not available anymore)
https://www.kaggle.com/Cornell-University/arxiv
C. B. Clement, M. Bierbaum, K. P. O'Keeffe and A. A. Alemi, “On the Use of ArXiv as a Dataset”, 2019, arXiv:1905.00075 [cs.IR].
Graph Representation
Type: Directed Graph
Node: Paper
Node Attributes: Metadata
Edge: [Paper 1] ⟶ [Cites] ⟶ [Paper 2]
Data Preparation
● Citation - remove self-loops, and remove citing to papers with no metadata
available
● Drop isolate nodes (600K) since we want to study the network and these isolate
nodes affect the averaging statistics such as avg. degree, avg.clustering
Text Preprocessing
● Title and Abstract - removing stop word and normalizing text using lemmatization
ANALYSIS
03
Network Statistics
❏ # Nodes: 1,115,865
❏ # Edges: 7,833,188
❏ Density: 6.3e-6
❏ Avg. degree: 14.0397
❏ Avg. clustering coefficient: 0.0823
❏ Largest connected component: 1,005,136
Low Degree
Missing the citations to the non-existing
papers in arXiv, and probably data issues. This
somehow tells us that our network does not
capture the real nature of the Citation Network
Low Density, Low Clustering Coefficients
Paper A created in 2017 is cited by Paper B
created in 2018. Paper A would not cite Paper
B. So the number of edges is not high
comparing to the possible edges of graph
Largest Connected Component
The size of the biggest Weakly Connected
Component (since this is a directed graph) is
considerably high. This means knowledge
across fields in arXiv are connected across
fields in some way.
Network Properties (1.1 M papers)
The out-degree is basically lower than the in-degreeLog scale in Y-Axis
Temporal Network Statistics
Citation Network grows through time as well as its statistics
*2020/Q2
By iteratively creating incremental subgraph from
the beginning up to a point of time, we compute the
network statistics yearly.
Page Rank
● Page Rank is used to determine the ranking of
a website in a Web Graph
● Since Graph is an universal language, this
concept can be applied to a Citation Network,
which is also a directed graph, as well
● Page Rank can represent how importance or
popular papers are
● Papers with high Page Rank score are
generally cited a lot and also cited by other
important papers
https://en.wikipedia.org/wiki/PageRank
Normalized Page Rank
In order to compare Page Rank across years, we use normalized Page Rank
to create Page Rank over Time statistics
K. Berberich, S. Bedathur, G. Weikum, “Normalized Page Rank for Evolving Graphs”, Max-Planck Institute for Informatics, Saarbrücken,
https://people.mpi-inf.mpg.de/~kberberi/presentations/2007-www2007.pdf
Page Rank over Time
(All Papers)
To reach an average PageRank greater than 3.5
for each published year, take at least 17 years
Cohort Analysis
Page Rank over Time
(cs.SI)
In Social and Information Network (cs.SI) field
PageRank of the published papers between Y’14 -
Y’17 takes only 3 - 6 years for being higher than 3.5
It can be implied that some papers are popularized
significantly after published
● 2014 : CNN, RNN
● 2015 : CNN, NN
● 2016 : NN,
● 2017 : Adam, CNN, GAN
New Old
Top 5 Page Rank over Time (All CS)
However Average Page Rank are sensitive to “outlier”
Title Similarity Network and Community
Nodes = Papers
Edges = Similarity between papers
Text preprocessing
● Lower case
● Remove punctuation
● Remove stopwords
● Lemmatization
● Bag of words
● TFIDF
Pairwise Cosine Similarity Output result
Adam: A Method for Stochastic Optimization
Title Similarity Network and Community (2)
Nodes Edge
Filter Cosine >= 0.7
Title Similarity Network and Community (3)
Filter No.Nodes in
community >= 10
182 Communities
but most of them are isolated community
10 Communities
Community Interpretation with LDA
Topic Modeling by Iterate
LDA model through each
community
Grouping
Graph Clustering - End-to-end process
GraphSAGE
http://snap.stanford.edu/graphsage/
W.L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs”, 2017, arXiv:1706.02216 [cs.SI]
GraphSAGE Implementation
StellarGraph Machine Learning Library
https://stellargraph.readthedocs.io/
Unsupervised Sampler
Node Pair
Positive
Positive
Positive
Label
Negative
Negative
Negative
Node Pair
Classifier
Sampling
Positive/Negative
Equally
Train
Label: whether the node pair co-occurs in
random walks of the graph
https://stellargraph.readthedocs.io/en/stable/demos/embeddings/graphsage-unsupervised-sampler-embeddings.html
Unsupervised GraphSAGE
GraphSAGE Encoder
graph structure +
node features
graph structure +
node features
+
Node Pair
Classification
0/1
Embedding Model
Train
graph structure +
node features
Node EmbeddingsAll nodes
50 Dimensions
50 Dimensions
Model Training and Embedding
Training using Machine Learning
Papers (40,635 nodes)
using basic parameter setup
Epoch: 20
Elapsed Time: 4-5 hours
Unfortunately, Loss doesn’t even budge.
There are a lot of things to improve, but we do not
have a proper environment at the moment.
Lesson learned: get GPU!
Choosing K
K-Means vs Mini Batch K-Means
Computing embedded 40K papers with 50 features each
Mini Batch K-Means: 0:00:58
K-Means: 0:12:38
D. Sculley, “Web-Scale K-Means Clustering”, Google, Inc., PA, USA, https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
To help selecting K using a scree plot, we can use
MiniBatch K-Means and Polynomial fit for approximate SSE
within a given K range. It turns out faster (obviously) and the
result seems close.
Machine Learning
Papers
40,635 Papers
Node Features
- TFIDF from Title + Abstract
(top 2000 words)
# Random Walk: 1
Random Walk Length: 5
NN layer: [50,50]
Embedding: 50 dimensions
K-Means: 10 clusters
Bubble size = Page Rank
Machine Learning
Papers
Overlay with Top 50 Most Page Rank
Score markers
ADAM Optimizer which has the most
page rank score are located in
Cluster 7 together with several other
Top 50 Rankers
Experiment with Node Features
BOW
BOW
TFIDF
TFIDF
Social and
Information Network
Papers
Let’s have a look at the papers in cs.SI
which is directly related to this subject
The 2nd most page rank score,
Graph Attention Networks is over there,
we may want to explore what’s inside
that cluster further
CONCLUSION
04
Conclusion
● Using Social Network Analysis can enrich the literature search
● One of the good traits of Graph is that it is an “Universal Language”
For the same data, we can generate different types of network depending on
how we define the “relationships”
Future Work
● Incorporating more NLP techniques
● Model tuning, or using different models, e.g. Graph Attention Networks
● Imagining navigating through the Citation Network using a graphical and
interactive UI would be ideal for students looking for research topics and
literature review
Slidesgo
Flaticon Freepik
Please keep this slide for attribution.
THANK YOU

ArXiv Literature Exploration using Social Network Analysis

  • 1.
    ArXiv Literature Exploration usingSocial Network Analysis Tanat Iempreedee (6210422036) Yothin Kittithorn (6210422037) Supalerk Pisitsupakarn (6210422040) Ratchasit Ngamsa-ardwarit (6210422060) Business Analytics and Data Science, Applied Statistics, NIDA
  • 2.
  • 3.
  • 4.
    WHY WE SELECTEDTHIS PROJECT ? Pain Point ● Searching for research papers is not easy for those who are not familiar. ● For the paper that we are studying, we might want to check on the other papers that are citing it or cited by it as well ● Want to see similar or related papers even if we do not get the search key words right ● Which one to prioritize first? Intro ● Exploring arXiv Citation Network using Social Network Analysis techniques ● Page Rank as the paper importance indicator ● Constructing Similarity Network by Titles’ similarity and proceed with Spectral Clustering ● Graph clustering using unsupervised GraphSAGE
  • 5.
  • 6.
    DATASET ArXiv Dataset Source :Kaggle arXiv Dataset (version 4) ● Metadata (1.7+ Million papers, 4.5GB) ID, Title, Abstract, Created Date, Category Format: JSON ● Internal Citation (171 MB) Citation that occurred only in ArXiv Format: JSON (internal citation data is not available anymore) https://www.kaggle.com/Cornell-University/arxiv C. B. Clement, M. Bierbaum, K. P. O'Keeffe and A. A. Alemi, “On the Use of ArXiv as a Dataset”, 2019, arXiv:1905.00075 [cs.IR].
  • 7.
    Graph Representation Type: DirectedGraph Node: Paper Node Attributes: Metadata Edge: [Paper 1] ⟶ [Cites] ⟶ [Paper 2]
  • 8.
    Data Preparation ● Citation- remove self-loops, and remove citing to papers with no metadata available ● Drop isolate nodes (600K) since we want to study the network and these isolate nodes affect the averaging statistics such as avg. degree, avg.clustering Text Preprocessing ● Title and Abstract - removing stop word and normalizing text using lemmatization
  • 9.
  • 10.
    Network Statistics ❏ #Nodes: 1,115,865 ❏ # Edges: 7,833,188 ❏ Density: 6.3e-6 ❏ Avg. degree: 14.0397 ❏ Avg. clustering coefficient: 0.0823 ❏ Largest connected component: 1,005,136 Low Degree Missing the citations to the non-existing papers in arXiv, and probably data issues. This somehow tells us that our network does not capture the real nature of the Citation Network Low Density, Low Clustering Coefficients Paper A created in 2017 is cited by Paper B created in 2018. Paper A would not cite Paper B. So the number of edges is not high comparing to the possible edges of graph Largest Connected Component The size of the biggest Weakly Connected Component (since this is a directed graph) is considerably high. This means knowledge across fields in arXiv are connected across fields in some way.
  • 11.
    Network Properties (1.1M papers) The out-degree is basically lower than the in-degreeLog scale in Y-Axis
  • 12.
    Temporal Network Statistics CitationNetwork grows through time as well as its statistics *2020/Q2 By iteratively creating incremental subgraph from the beginning up to a point of time, we compute the network statistics yearly.
  • 13.
    Page Rank ● PageRank is used to determine the ranking of a website in a Web Graph ● Since Graph is an universal language, this concept can be applied to a Citation Network, which is also a directed graph, as well ● Page Rank can represent how importance or popular papers are ● Papers with high Page Rank score are generally cited a lot and also cited by other important papers https://en.wikipedia.org/wiki/PageRank
  • 14.
    Normalized Page Rank Inorder to compare Page Rank across years, we use normalized Page Rank to create Page Rank over Time statistics K. Berberich, S. Bedathur, G. Weikum, “Normalized Page Rank for Evolving Graphs”, Max-Planck Institute for Informatics, Saarbrücken, https://people.mpi-inf.mpg.de/~kberberi/presentations/2007-www2007.pdf
  • 15.
    Page Rank overTime (All Papers) To reach an average PageRank greater than 3.5 for each published year, take at least 17 years Cohort Analysis
  • 16.
    Page Rank overTime (cs.SI) In Social and Information Network (cs.SI) field PageRank of the published papers between Y’14 - Y’17 takes only 3 - 6 years for being higher than 3.5 It can be implied that some papers are popularized significantly after published ● 2014 : CNN, RNN ● 2015 : CNN, NN ● 2016 : NN, ● 2017 : Adam, CNN, GAN New Old
  • 17.
    Top 5 PageRank over Time (All CS) However Average Page Rank are sensitive to “outlier”
  • 18.
    Title Similarity Networkand Community Nodes = Papers Edges = Similarity between papers Text preprocessing ● Lower case ● Remove punctuation ● Remove stopwords ● Lemmatization ● Bag of words ● TFIDF Pairwise Cosine Similarity Output result Adam: A Method for Stochastic Optimization
  • 19.
    Title Similarity Networkand Community (2) Nodes Edge Filter Cosine >= 0.7
  • 20.
    Title Similarity Networkand Community (3) Filter No.Nodes in community >= 10 182 Communities but most of them are isolated community 10 Communities
  • 21.
    Community Interpretation withLDA Topic Modeling by Iterate LDA model through each community Grouping
  • 22.
    Graph Clustering -End-to-end process
  • 23.
    GraphSAGE http://snap.stanford.edu/graphsage/ W.L. Hamilton, R.Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs”, 2017, arXiv:1706.02216 [cs.SI]
  • 24.
    GraphSAGE Implementation StellarGraph MachineLearning Library https://stellargraph.readthedocs.io/
  • 25.
    Unsupervised Sampler Node Pair Positive Positive Positive Label Negative Negative Negative NodePair Classifier Sampling Positive/Negative Equally Train Label: whether the node pair co-occurs in random walks of the graph https://stellargraph.readthedocs.io/en/stable/demos/embeddings/graphsage-unsupervised-sampler-embeddings.html
  • 26.
    Unsupervised GraphSAGE GraphSAGE Encoder graphstructure + node features graph structure + node features + Node Pair Classification 0/1 Embedding Model Train graph structure + node features Node EmbeddingsAll nodes 50 Dimensions 50 Dimensions
  • 27.
    Model Training andEmbedding Training using Machine Learning Papers (40,635 nodes) using basic parameter setup Epoch: 20 Elapsed Time: 4-5 hours Unfortunately, Loss doesn’t even budge. There are a lot of things to improve, but we do not have a proper environment at the moment. Lesson learned: get GPU!
  • 28.
    Choosing K K-Means vsMini Batch K-Means Computing embedded 40K papers with 50 features each Mini Batch K-Means: 0:00:58 K-Means: 0:12:38 D. Sculley, “Web-Scale K-Means Clustering”, Google, Inc., PA, USA, https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf To help selecting K using a scree plot, we can use MiniBatch K-Means and Polynomial fit for approximate SSE within a given K range. It turns out faster (obviously) and the result seems close.
  • 29.
    Machine Learning Papers 40,635 Papers NodeFeatures - TFIDF from Title + Abstract (top 2000 words) # Random Walk: 1 Random Walk Length: 5 NN layer: [50,50] Embedding: 50 dimensions K-Means: 10 clusters Bubble size = Page Rank
  • 30.
    Machine Learning Papers Overlay withTop 50 Most Page Rank Score markers ADAM Optimizer which has the most page rank score are located in Cluster 7 together with several other Top 50 Rankers
  • 31.
    Experiment with NodeFeatures BOW BOW TFIDF TFIDF
  • 32.
    Social and Information Network Papers Let’shave a look at the papers in cs.SI which is directly related to this subject The 2nd most page rank score, Graph Attention Networks is over there, we may want to explore what’s inside that cluster further
  • 33.
  • 34.
    Conclusion ● Using SocialNetwork Analysis can enrich the literature search ● One of the good traits of Graph is that it is an “Universal Language” For the same data, we can generate different types of network depending on how we define the “relationships” Future Work ● Incorporating more NLP techniques ● Model tuning, or using different models, e.g. Graph Attention Networks ● Imagining navigating through the Citation Network using a graphical and interactive UI would be ideal for students looking for research topics and literature review
  • 35.
    Slidesgo Flaticon Freepik Please keepthis slide for attribution. THANK YOU