Aim of the project is to learn vector representations for authors who publish scientific research papers . The representations should be such that authors who work in same domain ( i.e. same research area ) must be closer in vector space. These representations helps to categorize or cluster authors into various categories and further predict future collaboration based on past data.
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Learning scientific scholar representations using a combination of collaboration, citation graph and text data
1. Learning Scientific scholar representations using a
combination of Collaboration and Text data
Ankush Khandelwal
Raksha Jalan
Bhavitha K
( IRE GROUP : 40 )
Information Retrieval Extraction
IIIT-H Spring 2016
2. Problem :
❏ Aim of the project is to learn vector representations for authors
who publish scientific research papers .
❏ These representations should be such that authors who work in
same domain ( i.e. same research area ) must be closer in vector
space.
❏ These representations helps to categorize or cluster authors into
various categories and further predict future collaboration based
on past data.
3. Introduction :
❏ Representation Learning/Feature Learning technique (transformation
of raw data input into a representation) is performed to learn good
vector representations for authors.
❏ They have gained a great success in various applications like image
processing, speech recognition and natural language processing (NLP).
❏ The advantage is that once the vector representation is formed, the
difficult network mining tasks can be solved with the help of various
machine learning techniques.
4. Dataset :
❏ The DBLP computer science bibliography contains the metadata
of publications, written by several authors in thousands of
journals or conference proceedings series.
❏ We have used a subset of the dataset which has metadata of
around 2,75,000 papers.
5. Text-processing :
Parse the dataset file to
get a list of unique authors
and assign each author
with an id.
A snapshot of the auth id
file :
6. Co-authorship Information:
Each line in the given snapshot corresponds to a
paper.
The first line signifies that author with id 1 has
worked for the first paper.
The second line implies that authors with id’s 2
and 3 have collaborated for second paper and so
on.
The author name mapping to id is taken from the
authid file mentioned in the previous slide.
9. Training Neural Network :
I/p file : The input to the neural network will be the refined co-authorship file which contains
authors in positive and negative context w.r.t to every author.
10. Neural Network continued ..
➢ We have used torch for training neural network.
➢ Neural network is feeded with the positive and negative samples and is being iterated
for 10 epochs containing authors in the dataset and the vector representation for each
author is learned.
➢ The vector representations are learned to finally get authors in positive context closer
on vector space.
Vector representation sample (word-embedding size=30)
1:0.12774897468519,-1.2134315799647,0.28491147244956,0.8021796034968,0.24783552528964,0.064771391008334,-0.62943657350973,
-1.5811627032589,0.50791467408229,-0.016128751957846,-0.95420926437372,0.3088518152673,
-0.18527131689276,0.95070454842939,0.60509919040003,1.3706830088368,0.59082443074081,-2.3339685239631,
-2.5307487148746,0.2078369289687,0.32913756016955,1.6364679430803,0.65293421732019,-0.66457122621034,0.28869327954787,0.64982010840204,1.8983918247831,
-0.52790655050569,0.12223315845681,0.63230901357502
11. Classifying the vector Representations
Classification techniques that are used to classify the vector representations of
the authors are as follows
A. Stochastic Gradient Descent.
B. Support Vector Machines : RBF kernel is used and grid search is
performed.
C. Random Forest Classification.
13. CONCLUSION
Mean accuracy of 28 percent was observed on using random forest as
compared to SVM giving 30 percent.
Full text of the paper can be considered to get author representations
if in positive context closer based on the semantic context of papers
they worked on.
For negative context author selection,considering 1 degree or more
might also add on to the accuracy.
14. CHALLENGES :
Authors being sparsely distributed:
Many papers contained single author and this information of
authors who did not collaborate with any author were ignored while
feeding the input to neural network.