ArXiv Literature Exploration using Social Network Analysis

ArXiv Literature Exploration
using Social Network Analysis
Tanat Iempreedee (6210422036)
Yothin Kittithorn (6210422037)
Supalerk Pisitsupakarn (6210422040)
Ratchasit Ngamsa-ardwarit (6210422060)
Business Analytics and Data Science, Applied Statistics, NIDA

TABLE OF
CONTENTS
INTRODUCTION
DATASET
01
02
03
04
ANALYSIS
CONCLUSION

WHY WE SELECTED THIS PROJECT ?
Pain Point
● Searching for research papers is not easy for those who are not familiar.
● For the paper that we are studying, we might want to check on the other papers that are
citing it or cited by it as well
● Want to see similar or related papers even if we do not get the search key words right
● Which one to prioritize ﬁrst?
Intro
● Exploring arXiv Citation Network using Social Network Analysis techniques
● Page Rank as the paper importance indicator
● Constructing Similarity Network by Titles’ similarity and proceed with
Spectral Clustering
● Graph clustering using unsupervised GraphSAGE

DATASET
ArXiv Dataset
Source : Kaggle
arXiv Dataset (version 4)
● Metadata (1.7+ Million papers, 4.5GB)
ID, Title, Abstract, Created Date, Category
Format: JSON
● Internal Citation (171 MB)
Citation that occurred only in ArXiv
Format: JSON
(internal citation data is not available anymore)
https://www.kaggle.com/Cornell-University/arxiv
C. B. Clement, M. Bierbaum, K. P. O'Keeffe and A. A. Alemi, “On the Use of ArXiv as a Dataset”, 2019, arXiv:1905.00075 [cs.IR].

Graph Representation
Type: Directed Graph
Node: Paper
Node Attributes: Metadata
Edge: [Paper 1] ⟶ [Cites] ⟶ [Paper 2]

Data Preparation
● Citation - remove self-loops, and remove citing to papers with no metadata
available
● Drop isolate nodes (600K) since we want to study the network and these isolate
nodes affect the averaging statistics such as avg. degree, avg.clustering
Text Preprocessing
● Title and Abstract - removing stop word and normalizing text using lemmatization

Network Statistics
❏ # Nodes: 1,115,865
❏ # Edges: 7,833,188
❏ Density: 6.3e-6
❏ Avg. degree: 14.0397
❏ Avg. clustering coefficient: 0.0823
❏ Largest connected component: 1,005,136
Low Degree
Missing the citations to the non-existing
papers in arXiv, and probably data issues. This
somehow tells us that our network does not
capture the real nature of the Citation Network
Low Density, Low Clustering Coefficients
Paper A created in 2017 is cited by Paper B
created in 2018. Paper A would not cite Paper
B. So the number of edges is not high
comparing to the possible edges of graph
Largest Connected Component
The size of the biggest Weakly Connected
Component (since this is a directed graph) is
considerably high. This means knowledge
across fields in arXiv are connected across
fields in some way.

Network Properties (1.1 M papers)
The out-degree is basically lower than the in-degreeLog scale in Y-Axis

Temporal Network Statistics
Citation Network grows through time as well as its statistics
*2020/Q2
By iteratively creating incremental subgraph from
the beginning up to a point of time, we compute the
network statistics yearly.

Page Rank
● Page Rank is used to determine the ranking of
a website in a Web Graph
● Since Graph is an universal language, this
concept can be applied to a Citation Network,
which is also a directed graph, as well
● Page Rank can represent how importance or
popular papers are
● Papers with high Page Rank score are
generally cited a lot and also cited by other
important papers
https://en.wikipedia.org/wiki/PageRank

Normalized Page Rank
In order to compare Page Rank across years, we use normalized Page Rank
to create Page Rank over Time statistics
K. Berberich, S. Bedathur, G. Weikum, “Normalized Page Rank for Evolving Graphs”, Max-Planck Institute for Informatics, Saarbrücken,
https://people.mpi-inf.mpg.de/~kberberi/presentations/2007-www2007.pdf

Page Rank over Time
(All Papers)
To reach an average PageRank greater than 3.5
for each published year, take at least 17 years
Cohort Analysis

Page Rank over Time
(cs.SI)
In Social and Information Network (cs.SI) ﬁeld
PageRank of the published papers between Y’14 -
Y’17 takes only 3 - 6 years for being higher than 3.5
It can be implied that some papers are popularized
signiﬁcantly after published
● 2014 : CNN, RNN
● 2015 : CNN, NN
● 2016 : NN,
● 2017 : Adam, CNN, GAN
New Old

Top 5 Page Rank over Time (All CS)
However Average Page Rank are sensitive to “outlier”

Title Similarity Network and Community
Nodes = Papers
Edges = Similarity between papers
Text preprocessing
● Lower case
● Remove punctuation
● Remove stopwords
● Lemmatization
● Bag of words
● TFIDF
Pairwise Cosine Similarity Output result
Adam: A Method for Stochastic Optimization

Title Similarity Network and Community (2)
Nodes Edge
Filter Cosine >= 0.7

Title Similarity Network and Community (3)
Filter No.Nodes in
community >= 10
182 Communities
but most of them are isolated community
10 Communities

Community Interpretation with LDA
Topic Modeling by Iterate
LDA model through each
community
Grouping

Graph Clustering - End-to-end process

GraphSAGE
http://snap.stanford.edu/graphsage/
W.L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs”, 2017, arXiv:1706.02216 [cs.SI]

GraphSAGE Implementation
StellarGraph Machine Learning Library
https://stellargraph.readthedocs.io/

Unsupervised Sampler
Node Pair
Positive
Positive
Positive
Label
Negative
Negative
Negative
Node Pair
Classiﬁer
Sampling
Positive/Negative
Equally
Train
Label: whether the node pair co-occurs in
random walks of the graph
https://stellargraph.readthedocs.io/en/stable/demos/embeddings/graphsage-unsupervised-sampler-embeddings.html

Unsupervised GraphSAGE
GraphSAGE Encoder
graph structure +
node features
graph structure +
node features
+
Node Pair
Classiﬁcation
0/1
Embedding Model
Train
graph structure +
node features
Node EmbeddingsAll nodes
50 Dimensions
50 Dimensions

Model Training and Embedding
Training using Machine Learning
Papers (40,635 nodes)
using basic parameter setup
Epoch: 20
Elapsed Time: 4-5 hours
Unfortunately, Loss doesn’t even budge.
There are a lot of things to improve, but we do not
have a proper environment at the moment.
Lesson learned: get GPU!

Choosing K
K-Means vs Mini Batch K-Means
Computing embedded 40K papers with 50 features each
Mini Batch K-Means: 0:00:58
K-Means: 0:12:38
D. Sculley, “Web-Scale K-Means Clustering”, Google, Inc., PA, USA, https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
To help selecting K using a scree plot, we can use
MiniBatch K-Means and Polynomial ﬁt for approximate SSE
within a given K range. It turns out faster (obviously) and the
result seems close.

Machine Learning
Papers
40,635 Papers
Node Features
- TFIDF from Title + Abstract
(top 2000 words)
# Random Walk: 1
Random Walk Length: 5
NN layer: [50,50]
Embedding: 50 dimensions
K-Means: 10 clusters
Bubble size = Page Rank

Machine Learning
Papers
Overlay with Top 50 Most Page Rank
Score markers
ADAM Optimizer which has the most
page rank score are located in
Cluster 7 together with several other
Top 50 Rankers

Experiment with Node Features
BOW
BOW
TFIDF
TFIDF

Social and
Information Network
Papers
Let’s have a look at the papers in cs.SI
which is directly related to this subject
The 2nd most page rank score,
Graph Attention Networks is over there,
we may want to explore what’s inside
that cluster further

Conclusion
● Using Social Network Analysis can enrich the literature search
● One of the good traits of Graph is that it is an “Universal Language”
For the same data, we can generate different types of network depending on
how we deﬁne the “relationships”
Future Work
● Incorporating more NLP techniques
● Model tuning, or using different models, e.g. Graph Attention Networks
● Imagining navigating through the Citation Network using a graphical and
interactive UI would be ideal for students looking for research topics and
literature review

Slidesgo
Flaticon Freepik
Please keep this slide for attribution.
THANK YOU

ArXiv Literature Exploration using Social Network Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ArXiv Literature Exploration using Social Network Analysis

Similar to ArXiv Literature Exploration using Social Network Analysis (20)

Recently uploaded

Recently uploaded (20)

ArXiv Literature Exploration using Social Network Analysis