Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

Category & Training Texts Selection for
Scientific Article Categorization in
an Expert Search System
By
Gan Keng Hoon*, Chua San Thai,
Khoh Zhuo Yan, Goh Kau Yang
School of Computer Sciences,
Universiti Sains Malaysia

Motivation
Scientific articles are produced as results of research.
Organizing scientific articles into subject areas or topics
help in discovery, navigation etc.

Motivation
Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa ,
(2015),"Facet-value extraction scheme from textual contents in XML
data“.

Scope
Application oriented research
Expert Search System
DBLP Dataset
School of Computer Sciences, USM
Goal
Improving the categorization of scientific articles
For
Capturing expert’s expertise based on their publications.
Enable category filtering during search.

Existing Approaches
Labelled Scientific Article
Supervised Learning method to train and test
Feature Selection
Bags of Words, Ngram, POS, Term Frequency, TFIDF
This research
Train with Labelled Scientific Related Domain Texts
Test with Scientific Article

Research Justification
Avoid the use of large number of labelled training texts
Focusing on differentiating good training texts sources.
Use reasonable small number of training texts to build
subject category model.

Process of category model construction on
scientific article domain.

Feature Selection
Feature Term Generation
N-gram technique is used to generate potential term candidates from the training text. E.g.
D = “Search engine is an artificial intelligence system.”
2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] =>
artificial intelligence [5] => intelligence system)
Features Selection by TF-IDF
Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword
weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as
features. This method penalizes the term when it occurs in different training texts. The TF-
IDF values are computed as
𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
= 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
× 𝑙𝑙𝑙𝑙 𝑙𝑙
𝑁𝑁𝐷𝐷
𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the
total number of document.

Transfer Training Approach
Intuition
If the training texts are representative enough to cover the concept of a
category, hence this training sets can be obtained from any sources that share
similar concepts or semantics.
Criteria
Sharing same or partially similar categories between two texts source.
The categories must bear the same concept or meaning.
The training source must be comprehensive to cover a category’s concept.
The training source must be available but not the testing source.
This approach is particular useful when the resources of unseen texts are not
readily available.

Training and Testing Category Model
The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category,
𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of
features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶
The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛,
the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶
Feature Similarity Scoring
The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of
features set of category model is viewed as a set of vectors in a vector space. Each term will have its
own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the
deviation angle between the vectors as follows.
𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 =
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
is the feature vector of a new document.

Evaluation Settings
Performance Metric
Scientific article is correctly assigned to a category or otherwise.
Expert judgement to evaluate.
Training Texts
Title and Abstract are used.
Tasks
Common (30 general cat) vs. Common + Specific Categories (30
general cat + 12 domain specific )
Automated Selection of Training Texts vs. Manual

Evaluation Results
Common categories
+ Automated
training texts (%)
Common and specific
categories + Automated
training texts (%)
Common and specific
categories + Manual
training texts (%)
Expert 1 62.50 68.75 81.25
Expert 2 46.67 46.67 53.33
Expert 3 33.33 33.33 66.67
Expert 4 33.33 41.67 41.67
Expert 5 43.75 37.50 28.13
(Average) (43.92) (45.59) (54.21)

Conclusion
Possibility
To train a category model using training texts from one source and apply
them on a different source.
Challenge
Selection of training texts as they could influence the accuracy of trained
model.
Limitation
Selection of categories, whereby the selected set is too little to cover the
domain’s (e.g. Computer Science) research area.

Thank You
For more of our work, please visit ir.cs.usm.my
Email me at khgan@usm.my

Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

Similar to Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System (20)

More from Gan Keng Hoon

More from Gan Keng Hoon (17)

Recently uploaded

Recently uploaded (20)

Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System