SlideShare a Scribd company logo
1 of 35
Download to read offline
ArXiv Literature Exploration
using Social Network Analysis
Tanat Iempreedee (6210422036)
Yothin Kittithorn (6210422037)
Supalerk Pisitsupakarn (6210422040)
Ratchasit Ngamsa-ardwarit (6210422060)
Business Analytics and Data Science, Applied Statistics, NIDA
TABLE OF
CONTENTS
INTRODUCTION
DATASET
01
02
03
04
ANALYSIS
CONCLUSION
INTRODUCTION
01
WHY WE SELECTED THIS PROJECT ?
Pain Point
● Searching for research papers is not easy for those who are not familiar.
● For the paper that we are studying, we might want to check on the other papers that are
citing it or cited by it as well
● Want to see similar or related papers even if we do not get the search key words right
● Which one to prioritize first?
Intro
● Exploring arXiv Citation Network using Social Network Analysis techniques
● Page Rank as the paper importance indicator
● Constructing Similarity Network by Titles’ similarity and proceed with
Spectral Clustering
● Graph clustering using unsupervised GraphSAGE
DATASET
02
DATASET
ArXiv Dataset
Source : Kaggle
arXiv Dataset (version 4)
● Metadata (1.7+ Million papers, 4.5GB)
ID, Title, Abstract, Created Date, Category
Format: JSON
● Internal Citation (171 MB)
Citation that occurred only in ArXiv
Format: JSON
(internal citation data is not available anymore)
https://www.kaggle.com/Cornell-University/arxiv
C. B. Clement, M. Bierbaum, K. P. O'Keeffe and A. A. Alemi, “On the Use of ArXiv as a Dataset”, 2019, arXiv:1905.00075 [cs.IR].
Graph Representation
Type: Directed Graph
Node: Paper
Node Attributes: Metadata
Edge: [Paper 1] ⟶ [Cites] ⟶ [Paper 2]
Data Preparation
● Citation - remove self-loops, and remove citing to papers with no metadata
available
● Drop isolate nodes (600K) since we want to study the network and these isolate
nodes affect the averaging statistics such as avg. degree, avg.clustering
Text Preprocessing
● Title and Abstract - removing stop word and normalizing text using lemmatization
ANALYSIS
03
Network Statistics
❏ # Nodes: 1,115,865
❏ # Edges: 7,833,188
❏ Density: 6.3e-6
❏ Avg. degree: 14.0397
❏ Avg. clustering coefficient: 0.0823
❏ Largest connected component: 1,005,136
Low Degree
Missing the citations to the non-existing
papers in arXiv, and probably data issues. This
somehow tells us that our network does not
capture the real nature of the Citation Network
Low Density, Low Clustering Coefficients
Paper A created in 2017 is cited by Paper B
created in 2018. Paper A would not cite Paper
B. So the number of edges is not high
comparing to the possible edges of graph
Largest Connected Component
The size of the biggest Weakly Connected
Component (since this is a directed graph) is
considerably high. This means knowledge
across fields in arXiv are connected across
fields in some way.
Network Properties (1.1 M papers)
The out-degree is basically lower than the in-degreeLog scale in Y-Axis
Temporal Network Statistics
Citation Network grows through time as well as its statistics
*2020/Q2
By iteratively creating incremental subgraph from
the beginning up to a point of time, we compute the
network statistics yearly.
Page Rank
● Page Rank is used to determine the ranking of
a website in a Web Graph
● Since Graph is an universal language, this
concept can be applied to a Citation Network,
which is also a directed graph, as well
● Page Rank can represent how importance or
popular papers are
● Papers with high Page Rank score are
generally cited a lot and also cited by other
important papers
https://en.wikipedia.org/wiki/PageRank
Normalized Page Rank
In order to compare Page Rank across years, we use normalized Page Rank
to create Page Rank over Time statistics
K. Berberich, S. Bedathur, G. Weikum, “Normalized Page Rank for Evolving Graphs”, Max-Planck Institute for Informatics, Saarbrücken,
https://people.mpi-inf.mpg.de/~kberberi/presentations/2007-www2007.pdf
Page Rank over Time
(All Papers)
To reach an average PageRank greater than 3.5
for each published year, take at least 17 years
Cohort Analysis
Page Rank over Time
(cs.SI)
In Social and Information Network (cs.SI) field
PageRank of the published papers between Y’14 -
Y’17 takes only 3 - 6 years for being higher than 3.5
It can be implied that some papers are popularized
significantly after published
● 2014 : CNN, RNN
● 2015 : CNN, NN
● 2016 : NN,
● 2017 : Adam, CNN, GAN
New Old
Top 5 Page Rank over Time (All CS)
However Average Page Rank are sensitive to “outlier”
Title Similarity Network and Community
Nodes = Papers
Edges = Similarity between papers
Text preprocessing
● Lower case
● Remove punctuation
● Remove stopwords
● Lemmatization
● Bag of words
● TFIDF
Pairwise Cosine Similarity Output result
Adam: A Method for Stochastic Optimization
Title Similarity Network and Community (2)
Nodes Edge
Filter Cosine >= 0.7
Title Similarity Network and Community (3)
Filter No.Nodes in
community >= 10
182 Communities
but most of them are isolated community
10 Communities
Community Interpretation with LDA
Topic Modeling by Iterate
LDA model through each
community
Grouping
Graph Clustering - End-to-end process
GraphSAGE
http://snap.stanford.edu/graphsage/
W.L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs”, 2017, arXiv:1706.02216 [cs.SI]
GraphSAGE Implementation
StellarGraph Machine Learning Library
https://stellargraph.readthedocs.io/
Unsupervised Sampler
Node Pair
Positive
Positive
Positive
Label
Negative
Negative
Negative
Node Pair
Classifier
Sampling
Positive/Negative
Equally
Train
Label: whether the node pair co-occurs in
random walks of the graph
https://stellargraph.readthedocs.io/en/stable/demos/embeddings/graphsage-unsupervised-sampler-embeddings.html
Unsupervised GraphSAGE
GraphSAGE Encoder
graph structure +
node features
graph structure +
node features
+
Node Pair
Classification
0/1
Embedding Model
Train
graph structure +
node features
Node EmbeddingsAll nodes
50 Dimensions
50 Dimensions
Model Training and Embedding
Training using Machine Learning
Papers (40,635 nodes)
using basic parameter setup
Epoch: 20
Elapsed Time: 4-5 hours
Unfortunately, Loss doesn’t even budge.
There are a lot of things to improve, but we do not
have a proper environment at the moment.
Lesson learned: get GPU!
Choosing K
K-Means vs Mini Batch K-Means
Computing embedded 40K papers with 50 features each
Mini Batch K-Means: 0:00:58
K-Means: 0:12:38
D. Sculley, “Web-Scale K-Means Clustering”, Google, Inc., PA, USA, https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
To help selecting K using a scree plot, we can use
MiniBatch K-Means and Polynomial fit for approximate SSE
within a given K range. It turns out faster (obviously) and the
result seems close.
Machine Learning
Papers
40,635 Papers
Node Features
- TFIDF from Title + Abstract
(top 2000 words)
# Random Walk: 1
Random Walk Length: 5
NN layer: [50,50]
Embedding: 50 dimensions
K-Means: 10 clusters
Bubble size = Page Rank
Machine Learning
Papers
Overlay with Top 50 Most Page Rank
Score markers
ADAM Optimizer which has the most
page rank score are located in
Cluster 7 together with several other
Top 50 Rankers
Experiment with Node Features
BOW
BOW
TFIDF
TFIDF
Social and
Information Network
Papers
Let’s have a look at the papers in cs.SI
which is directly related to this subject
The 2nd most page rank score,
Graph Attention Networks is over there,
we may want to explore what’s inside
that cluster further
CONCLUSION
04
Conclusion
● Using Social Network Analysis can enrich the literature search
● One of the good traits of Graph is that it is an “Universal Language”
For the same data, we can generate different types of network depending on
how we define the “relationships”
Future Work
● Incorporating more NLP techniques
● Model tuning, or using different models, e.g. Graph Attention Networks
● Imagining navigating through the Citation Network using a graphical and
interactive UI would be ideal for students looking for research topics and
literature review
Slidesgo
Flaticon Freepik
Please keep this slide for attribution.
THANK YOU

More Related Content

What's hot

Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
 
Keyword Query Routing
Keyword Query RoutingKeyword Query Routing
Keyword Query RoutingSWAMI06
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsSutanay Choudhury
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science Neo4j
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationJen Stirrup
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration James Hendler
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory ProjectScalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory Projectjoshpaulson
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 

What's hot (20)

Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
Keyword Query Routing
Keyword Query RoutingKeyword Query Routing
Keyword Query Routing
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory ProjectScalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 

Similar to ArXiv Literature Exploration using Social Network Analysis

GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceNeo4j
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAbhishek Mungoli
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereEugene Hanikblum
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsAndre Freitas
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
What Makes Graph Queries Difficult?
What Makes Graph Queries Difficult?What Makes Graph Queries Difficult?
What Makes Graph Queries Difficult?Gábor Szárnyas
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AINeo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryNeo4j
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project ReportArnab Mukhopadhyay
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routingIEEEMEMTECHSTUDENTSPROJECTS
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...Tao Xie
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
 

Similar to ArXiv Literature Exploration using Social Network Analysis (20)

GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data Science
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: Simrank
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
What Makes Graph Queries Difficult?
What Makes Graph Queries Difficult?What Makes Graph Queries Difficult?
What Makes Graph Queries Difficult?
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
GraphDB
GraphDBGraphDB
GraphDB
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

ArXiv Literature Exploration using Social Network Analysis

  • 1. ArXiv Literature Exploration using Social Network Analysis Tanat Iempreedee (6210422036) Yothin Kittithorn (6210422037) Supalerk Pisitsupakarn (6210422040) Ratchasit Ngamsa-ardwarit (6210422060) Business Analytics and Data Science, Applied Statistics, NIDA
  • 4. WHY WE SELECTED THIS PROJECT ? Pain Point ● Searching for research papers is not easy for those who are not familiar. ● For the paper that we are studying, we might want to check on the other papers that are citing it or cited by it as well ● Want to see similar or related papers even if we do not get the search key words right ● Which one to prioritize first? Intro ● Exploring arXiv Citation Network using Social Network Analysis techniques ● Page Rank as the paper importance indicator ● Constructing Similarity Network by Titles’ similarity and proceed with Spectral Clustering ● Graph clustering using unsupervised GraphSAGE
  • 6. DATASET ArXiv Dataset Source : Kaggle arXiv Dataset (version 4) ● Metadata (1.7+ Million papers, 4.5GB) ID, Title, Abstract, Created Date, Category Format: JSON ● Internal Citation (171 MB) Citation that occurred only in ArXiv Format: JSON (internal citation data is not available anymore) https://www.kaggle.com/Cornell-University/arxiv C. B. Clement, M. Bierbaum, K. P. O'Keeffe and A. A. Alemi, “On the Use of ArXiv as a Dataset”, 2019, arXiv:1905.00075 [cs.IR].
  • 7. Graph Representation Type: Directed Graph Node: Paper Node Attributes: Metadata Edge: [Paper 1] ⟶ [Cites] ⟶ [Paper 2]
  • 8. Data Preparation ● Citation - remove self-loops, and remove citing to papers with no metadata available ● Drop isolate nodes (600K) since we want to study the network and these isolate nodes affect the averaging statistics such as avg. degree, avg.clustering Text Preprocessing ● Title and Abstract - removing stop word and normalizing text using lemmatization
  • 10. Network Statistics ❏ # Nodes: 1,115,865 ❏ # Edges: 7,833,188 ❏ Density: 6.3e-6 ❏ Avg. degree: 14.0397 ❏ Avg. clustering coefficient: 0.0823 ❏ Largest connected component: 1,005,136 Low Degree Missing the citations to the non-existing papers in arXiv, and probably data issues. This somehow tells us that our network does not capture the real nature of the Citation Network Low Density, Low Clustering Coefficients Paper A created in 2017 is cited by Paper B created in 2018. Paper A would not cite Paper B. So the number of edges is not high comparing to the possible edges of graph Largest Connected Component The size of the biggest Weakly Connected Component (since this is a directed graph) is considerably high. This means knowledge across fields in arXiv are connected across fields in some way.
  • 11. Network Properties (1.1 M papers) The out-degree is basically lower than the in-degreeLog scale in Y-Axis
  • 12. Temporal Network Statistics Citation Network grows through time as well as its statistics *2020/Q2 By iteratively creating incremental subgraph from the beginning up to a point of time, we compute the network statistics yearly.
  • 13. Page Rank ● Page Rank is used to determine the ranking of a website in a Web Graph ● Since Graph is an universal language, this concept can be applied to a Citation Network, which is also a directed graph, as well ● Page Rank can represent how importance or popular papers are ● Papers with high Page Rank score are generally cited a lot and also cited by other important papers https://en.wikipedia.org/wiki/PageRank
  • 14. Normalized Page Rank In order to compare Page Rank across years, we use normalized Page Rank to create Page Rank over Time statistics K. Berberich, S. Bedathur, G. Weikum, “Normalized Page Rank for Evolving Graphs”, Max-Planck Institute for Informatics, Saarbrücken, https://people.mpi-inf.mpg.de/~kberberi/presentations/2007-www2007.pdf
  • 15. Page Rank over Time (All Papers) To reach an average PageRank greater than 3.5 for each published year, take at least 17 years Cohort Analysis
  • 16. Page Rank over Time (cs.SI) In Social and Information Network (cs.SI) field PageRank of the published papers between Y’14 - Y’17 takes only 3 - 6 years for being higher than 3.5 It can be implied that some papers are popularized significantly after published ● 2014 : CNN, RNN ● 2015 : CNN, NN ● 2016 : NN, ● 2017 : Adam, CNN, GAN New Old
  • 17. Top 5 Page Rank over Time (All CS) However Average Page Rank are sensitive to “outlier”
  • 18. Title Similarity Network and Community Nodes = Papers Edges = Similarity between papers Text preprocessing ● Lower case ● Remove punctuation ● Remove stopwords ● Lemmatization ● Bag of words ● TFIDF Pairwise Cosine Similarity Output result Adam: A Method for Stochastic Optimization
  • 19. Title Similarity Network and Community (2) Nodes Edge Filter Cosine >= 0.7
  • 20. Title Similarity Network and Community (3) Filter No.Nodes in community >= 10 182 Communities but most of them are isolated community 10 Communities
  • 21. Community Interpretation with LDA Topic Modeling by Iterate LDA model through each community Grouping
  • 22. Graph Clustering - End-to-end process
  • 23. GraphSAGE http://snap.stanford.edu/graphsage/ W.L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs”, 2017, arXiv:1706.02216 [cs.SI]
  • 24. GraphSAGE Implementation StellarGraph Machine Learning Library https://stellargraph.readthedocs.io/
  • 25. Unsupervised Sampler Node Pair Positive Positive Positive Label Negative Negative Negative Node Pair Classifier Sampling Positive/Negative Equally Train Label: whether the node pair co-occurs in random walks of the graph https://stellargraph.readthedocs.io/en/stable/demos/embeddings/graphsage-unsupervised-sampler-embeddings.html
  • 26. Unsupervised GraphSAGE GraphSAGE Encoder graph structure + node features graph structure + node features + Node Pair Classification 0/1 Embedding Model Train graph structure + node features Node EmbeddingsAll nodes 50 Dimensions 50 Dimensions
  • 27. Model Training and Embedding Training using Machine Learning Papers (40,635 nodes) using basic parameter setup Epoch: 20 Elapsed Time: 4-5 hours Unfortunately, Loss doesn’t even budge. There are a lot of things to improve, but we do not have a proper environment at the moment. Lesson learned: get GPU!
  • 28. Choosing K K-Means vs Mini Batch K-Means Computing embedded 40K papers with 50 features each Mini Batch K-Means: 0:00:58 K-Means: 0:12:38 D. Sculley, “Web-Scale K-Means Clustering”, Google, Inc., PA, USA, https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf To help selecting K using a scree plot, we can use MiniBatch K-Means and Polynomial fit for approximate SSE within a given K range. It turns out faster (obviously) and the result seems close.
  • 29. Machine Learning Papers 40,635 Papers Node Features - TFIDF from Title + Abstract (top 2000 words) # Random Walk: 1 Random Walk Length: 5 NN layer: [50,50] Embedding: 50 dimensions K-Means: 10 clusters Bubble size = Page Rank
  • 30. Machine Learning Papers Overlay with Top 50 Most Page Rank Score markers ADAM Optimizer which has the most page rank score are located in Cluster 7 together with several other Top 50 Rankers
  • 31. Experiment with Node Features BOW BOW TFIDF TFIDF
  • 32. Social and Information Network Papers Let’s have a look at the papers in cs.SI which is directly related to this subject The 2nd most page rank score, Graph Attention Networks is over there, we may want to explore what’s inside that cluster further
  • 34. Conclusion ● Using Social Network Analysis can enrich the literature search ● One of the good traits of Graph is that it is an “Universal Language” For the same data, we can generate different types of network depending on how we define the “relationships” Future Work ● Incorporating more NLP techniques ● Model tuning, or using different models, e.g. Graph Attention Networks ● Imagining navigating through the Citation Network using a graphical and interactive UI would be ideal for students looking for research topics and literature review
  • 35. Slidesgo Flaticon Freepik Please keep this slide for attribution. THANK YOU