ArXiv Literature Exploration using Social Network Analysis
1. ArXiv Literature Exploration
using Social Network Analysis
Tanat Iempreedee (6210422036)
Yothin Kittithorn (6210422037)
Supalerk Pisitsupakarn (6210422040)
Ratchasit Ngamsa-ardwarit (6210422060)
Business Analytics and Data Science, Applied Statistics, NIDA
4. WHY WE SELECTED THIS PROJECT ?
Pain Point
● Searching for research papers is not easy for those who are not familiar.
● For the paper that we are studying, we might want to check on the other papers that are
citing it or cited by it as well
● Want to see similar or related papers even if we do not get the search key words right
● Which one to prioritize first?
Intro
● Exploring arXiv Citation Network using Social Network Analysis techniques
● Page Rank as the paper importance indicator
● Constructing Similarity Network by Titles’ similarity and proceed with
Spectral Clustering
● Graph clustering using unsupervised GraphSAGE
6. DATASET
ArXiv Dataset
Source : Kaggle
arXiv Dataset (version 4)
● Metadata (1.7+ Million papers, 4.5GB)
ID, Title, Abstract, Created Date, Category
Format: JSON
● Internal Citation (171 MB)
Citation that occurred only in ArXiv
Format: JSON
(internal citation data is not available anymore)
https://www.kaggle.com/Cornell-University/arxiv
C. B. Clement, M. Bierbaum, K. P. O'Keeffe and A. A. Alemi, “On the Use of ArXiv as a Dataset”, 2019, arXiv:1905.00075 [cs.IR].
8. Data Preparation
● Citation - remove self-loops, and remove citing to papers with no metadata
available
● Drop isolate nodes (600K) since we want to study the network and these isolate
nodes affect the averaging statistics such as avg. degree, avg.clustering
Text Preprocessing
● Title and Abstract - removing stop word and normalizing text using lemmatization
10. Network Statistics
❏ # Nodes: 1,115,865
❏ # Edges: 7,833,188
❏ Density: 6.3e-6
❏ Avg. degree: 14.0397
❏ Avg. clustering coefficient: 0.0823
❏ Largest connected component: 1,005,136
Low Degree
Missing the citations to the non-existing
papers in arXiv, and probably data issues. This
somehow tells us that our network does not
capture the real nature of the Citation Network
Low Density, Low Clustering Coefficients
Paper A created in 2017 is cited by Paper B
created in 2018. Paper A would not cite Paper
B. So the number of edges is not high
comparing to the possible edges of graph
Largest Connected Component
The size of the biggest Weakly Connected
Component (since this is a directed graph) is
considerably high. This means knowledge
across fields in arXiv are connected across
fields in some way.
11. Network Properties (1.1 M papers)
The out-degree is basically lower than the in-degreeLog scale in Y-Axis
12. Temporal Network Statistics
Citation Network grows through time as well as its statistics
*2020/Q2
By iteratively creating incremental subgraph from
the beginning up to a point of time, we compute the
network statistics yearly.
13. Page Rank
● Page Rank is used to determine the ranking of
a website in a Web Graph
● Since Graph is an universal language, this
concept can be applied to a Citation Network,
which is also a directed graph, as well
● Page Rank can represent how importance or
popular papers are
● Papers with high Page Rank score are
generally cited a lot and also cited by other
important papers
https://en.wikipedia.org/wiki/PageRank
14. Normalized Page Rank
In order to compare Page Rank across years, we use normalized Page Rank
to create Page Rank over Time statistics
K. Berberich, S. Bedathur, G. Weikum, “Normalized Page Rank for Evolving Graphs”, Max-Planck Institute for Informatics, Saarbrücken,
https://people.mpi-inf.mpg.de/~kberberi/presentations/2007-www2007.pdf
15. Page Rank over Time
(All Papers)
To reach an average PageRank greater than 3.5
for each published year, take at least 17 years
Cohort Analysis
16. Page Rank over Time
(cs.SI)
In Social and Information Network (cs.SI) field
PageRank of the published papers between Y’14 -
Y’17 takes only 3 - 6 years for being higher than 3.5
It can be implied that some papers are popularized
significantly after published
● 2014 : CNN, RNN
● 2015 : CNN, NN
● 2016 : NN,
● 2017 : Adam, CNN, GAN
New Old
17. Top 5 Page Rank over Time (All CS)
However Average Page Rank are sensitive to “outlier”
18. Title Similarity Network and Community
Nodes = Papers
Edges = Similarity between papers
Text preprocessing
● Lower case
● Remove punctuation
● Remove stopwords
● Lemmatization
● Bag of words
● TFIDF
Pairwise Cosine Similarity Output result
Adam: A Method for Stochastic Optimization
20. Title Similarity Network and Community (3)
Filter No.Nodes in
community >= 10
182 Communities
but most of them are isolated community
10 Communities
26. Unsupervised GraphSAGE
GraphSAGE Encoder
graph structure +
node features
graph structure +
node features
+
Node Pair
Classification
0/1
Embedding Model
Train
graph structure +
node features
Node EmbeddingsAll nodes
50 Dimensions
50 Dimensions
27. Model Training and Embedding
Training using Machine Learning
Papers (40,635 nodes)
using basic parameter setup
Epoch: 20
Elapsed Time: 4-5 hours
Unfortunately, Loss doesn’t even budge.
There are a lot of things to improve, but we do not
have a proper environment at the moment.
Lesson learned: get GPU!
28. Choosing K
K-Means vs Mini Batch K-Means
Computing embedded 40K papers with 50 features each
Mini Batch K-Means: 0:00:58
K-Means: 0:12:38
D. Sculley, “Web-Scale K-Means Clustering”, Google, Inc., PA, USA, https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
To help selecting K using a scree plot, we can use
MiniBatch K-Means and Polynomial fit for approximate SSE
within a given K range. It turns out faster (obviously) and the
result seems close.
29. Machine Learning
Papers
40,635 Papers
Node Features
- TFIDF from Title + Abstract
(top 2000 words)
# Random Walk: 1
Random Walk Length: 5
NN layer: [50,50]
Embedding: 50 dimensions
K-Means: 10 clusters
Bubble size = Page Rank
30. Machine Learning
Papers
Overlay with Top 50 Most Page Rank
Score markers
ADAM Optimizer which has the most
page rank score are located in
Cluster 7 together with several other
Top 50 Rankers
32. Social and
Information Network
Papers
Let’s have a look at the papers in cs.SI
which is directly related to this subject
The 2nd most page rank score,
Graph Attention Networks is over there,
we may want to explore what’s inside
that cluster further
34. Conclusion
● Using Social Network Analysis can enrich the literature search
● One of the good traits of Graph is that it is an “Universal Language”
For the same data, we can generate different types of network depending on
how we define the “relationships”
Future Work
● Incorporating more NLP techniques
● Model tuning, or using different models, e.g. Graph Attention Networks
● Imagining navigating through the Citation Network using a graphical and
interactive UI would be ideal for students looking for research topics and
literature review