This is the presentation slides for the workshop BigScholar 2019 in conjunction with CIKM 2019 (ACM International Conference on Information and Knowledge Management) Nov 7, 2019, at CNCC, Beijing, China.
Citation: Kurakawa K, Sun Y and Ando S (2020) Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence. Front. Big Data 2:48. doi: 10.3389/fdata.2019.00048
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence
1. Application of
a Novel Subject Classification Scheme
for a Bibliographic Database
Using a Data-Driven Correspondence
Kei Kurakawa, Yuan Sun
National Institute of Informatics, Japan
Satoko Ando
Clarivate Analytics (Japan) Co., Ltd.
This is the presentation slides for the workshop BigScholar 2019 in conjunction with CIKM 2019 (ACM International Conference
on Information and Knowledge Management) Nov 7, 2019, at CNCC, Beijing, China.
Citation: Kurakawa K, Sun Y and Ando S (2020) Application of a Novel Subject Classification Scheme for a Bibliographic Database
Using a Data-Driven Correspondence. Front. Big Data 2:48. doi: 10.3389/fdata.2019.00048
2. Overview
• Introduction
• Motivation
• Applying a new subject classification scheme for a subject-classified bibliographic database
• Our main contributions
• Related work
• Theoretical background
• Subject classification model of the bibliographic database based on set theory
• Main steps of our data-driven approach
• Case study
• Applying the Japanese grants KAKENHI subject classification scheme for the Web of
Science citation database
• Conclusions and future work
2
3. Motivation
• In assessing research activities based on bibliometrics, analysts are
accustomed to use the major citation database Web of Science whose
subject classification schemes, i.e. WoS Subject Category, ESI, and
GIPP are prepared for qualitative analysis.
• Analysts need domestic subject classification schemes for their
analysis, which are not implemented on the database.
• Applying a new classification scheme for the database by hand is too
much labor intensive and time consuming task.
• How can we apply a new classification scheme for the database,
efficiently and effectively?
3
4. Our main contributions
• We propose an approach to apply a novel subject classification
scheme for a subject-classified database using a data-driven
correspondence between the new and present ones, which is
accustomed to digital libraries.
• We give a fundamental analytical model of subject classification
scheme based on set theory and describe compact topological space
formation for a new subject classification scheme as a necessary
condition.
• We demonstrate the effectiveness and efficiency of our approach to a
practical bibliographic database.
4
5. Related work
• In the field of computer science,
• Information retrieval
• Data mining
• Digital libraries
• Automated text categorization
• Classification (supervised learning)
• Naïve bays classification
• Neural networks
• Support vector machines
• Clustering (unsupervised learning)
• K-means
• Expectation maximization (EM)
• Hierarchical agglomerative clustering
• Divisive clustering
• Matrix decompositions
• More problem specific method
• Multi-label classification / multi-label
learning, based on
• SVM
• Deep learning
• Ensemble classification.
• Extreme multi-label classification, based on
• Graph embedding
• Convolutional neural network (CNN)
• Attention model of neural networks
• Label hierarchy considered
• A method of mapping between different
classification schemes
• Importing cataloguing records using a
different classification scheme in digital
libraries
• Information integration on the Web
5
6. Theoretical background
• Subject classification model of the bibliographic database
• Compact topological space formation for a new subject classification
scheme
• Inducing a correspondence between two subject classification
schemes using a research project database
6
12. Given a finite cover
𝑆
Compact topological space12
𝔒(1)
= {𝑂𝑖}
𝑆, 𝔒 1
13. Another set of categories
𝑆
Compact topological space13
𝔒(2)
= {𝑂𝑖}
𝑆, 𝔒 2
14. If we have an external database such as …
• Research project database
𝑆′
𝑆
𝑇
𝑏
𝑂
ℎ
articles
projects
Compact topological space
𝑇, 𝔒 𝑇
2
14
15. If we have an external database such as …
• Research project database
𝑆′
𝑆
𝑇
𝑏
𝑂
ℎ
articles
projects
Compact topological space
𝑇, 𝔒 𝑇
2
Compact topological space
𝑆′
, 𝔒 𝑆′
2
We can define a compact
topological space for the
second set of categories.
15
16. Compact topological spaces for the two
subject classification schemes
𝑆′
𝔒 1
= {𝑂𝑖
1
}
Compact topological space
𝑆′
, 𝔒 𝑆′
2
𝑆′
, 𝔒 𝑆′
1
𝔒 2
= {𝑂𝑖
2
}
16
21. Main steps of our approach
1’-2. Inducing a correspondence between the two
subject classification schemes by using pseudo
𝐹𝛽-measure
1’-1. Constructing a contingency table
between two subject classification
schemes
2. Revising the correspondence to guarantee the existence of a finite
cover of the novel subject classification scheme
1. Inducing a
correspondence between
the two subject
classification schemes by
using 𝐹𝛽-measure
21
22. Case study
• InCites™ (Clarivate Analytics)
• A world class research evaluation platform
• Web of Science™ citation database
• Web of Science classification scheme (251 categories)
• Essential Science Indicator(ESI) classification scheme
(22 categories)
• Japanese users are eager to utilize the subject
classification scheme of Japan’s largest national
research grants KAKENHI.
• KAKEN (NII) research project database
• Archival records of research projects and the
outputs of KAKENHI grants in Japan.
• KAKENHI subject classification scheme (hierarchical
classification scheme; 4 categories, 10 areas, 67
disciplines, and 284 research fields)
22
25. Developing a contingency table as evidence
data
• We identified the same bibliographic records in the WoS citation database as of 2009 and
2010 through a set of record linkage techniques to obtain a set of articles 𝑆′ that are
classified using both the KAKENHI and WoS classification schemes.
𝑆′
𝑆𝑇
𝑏
𝑂
ℎ
Web of ScienceKAKEN
𝑎1
𝑎2
𝑎3
𝑎1
′
𝑎2
′
𝑎3
′
articles articlesprojects
Bibliographic linkage
𝑎1
′
≡ 𝑎1
𝑎2
′
≡ 𝑎2
𝑎3
′
≡ 𝑎3
25
26. A contingency table between WoS and
KAKENHI subject classification schemes
a part of 251 WoS categories
x 67 KAKENHI areas
26
27. Analysis of the contingency table
27
,where is the rank value, its maximum value,
a normalized constant
and two fitting components.
The discrete generalized beta distribution (DGBD)
31. 31
Average
cardinality
Average no. of WoS
subject categories
Average pseudo
precision
Average pseudo
recall
Average pseudo
F1 measure
1450.4 6.1 0.315 0.367 0.317
The third-level
67 disciplines
seq. no.
KAKENHI subject category Translation Cardinality No. of WoS subject
categories to get the
max pseudo F1-measure
Pseudo precision Pseudo recall Max pseudo F1
measure
(l3-46) 材料工学 Material Engineering 2931 6 0.348 0.523 0.418
(l3-47) プロセス工学 Process/Chemical Engineering 1283 4 0.145 0.306 0.197
(l3-48) 総合工学 Integrated Engineering 1465 8 0.256 0.309 0.280
(l3-49) 基礎生物学 Basic Biology 2423 7 0.375 0.400 0.387
(l3-50) 生物科学 Biological Science 2679 4 0.167 0.582 0.259
(l3-51) 人類学 Anthropology 300 3 0.315 0.440 0.367
(l3-52) 農学 Plant Production and
Environmental Agriculture
899 4 0.307 0.449 0.365
(l3-53) 農芸化学 Agricultural Chemistry 1755 6 0.220 0.386 0.281
(l3-54) 林学 Forest and Forest Products
Science
559 5 0.408 0.252 0.312
(l3-55) 水産学 Applied Aquatic Science 581 2 0.419 0.327 0.367
(l3-56) 農業経済学 Agricultural Science in Society
and Economy
31 2 0.333 0.097 0.150
(l3-57) 農業工学 Agro-Engineering 216 4 0.157 0.259 0.195
(l3-58) 畜産学・獣医学 Animal Life Science 1190 4 0.511 0.387 0.440
(l3-59) 境界農学 Boundary Agriculture 541 4 0.235 0.148 0.181
(l3-60) 薬学 Pharmacy 3457 4 0.294 0.369 0.328
(l3-61) 基礎医学 Basic Medicine 5232 16 0.213 0.551 0.307
(l3-62) 境界医学 Boundary Medicine 850 12 0.162 0.112 0.132
(l3-63) 社会医学 Society Medicine 1065 8 0.282 0.262 0.271
32. Miscellaneous considerations
• Decision by an expert
• Limit the number of correspondence to 1 – 4 for 𝑂𝑖
1
.
• For every Web of Science subject category 𝑂𝑖
1
, the number of relations with KAKENHI
subject categories 𝑂𝑗
2
is limited to 4 at most.
• For every Web of Science subject category 𝑂𝑖
1
, when the recall rate exceeds a half, we
stop adding any more relation.
• Check all correspondence between 𝑂𝑖
1
and 𝑂𝑗
2
.
• Add or remove correspondence relations between them by means of subject
classification keywords.
32
34. Example screen of InCites™
34
WoS Documents: 58,395,008
for Web of Science subject categories
WoS Documents: 3,192,449
for Web of Science subject categories
limited with
“LOCATION = JAPAN”
WoS Documents: 3,191,448
for KAKEN L3 subject categories
limited with
“LOCATION = JAPAN”
(a snapshot of 2018-12-14)
The bubbles representing
proportional numbers of
articles classified using the
KAKENHI subject categories
35. Top 30 subject distribution of Japanese authors’
articles with the two subject classification
schemes
WoS subject classification scheme KAKENHI subject classification scheme
35
36. User feedback: Questions and answers on the
validity of the KAKENHI subject classification
scheme
• KAKEN classification scheme
• April 2016, released on InCites Benchmarking
• User survey
• March 2017 by online questionnaire for
institutional active users
• 18 questions
• Results
• 26 institutional users feedback
• Q7
• Which levels of hierarchy in KAKENHI subject
classification scheme do you need?
• Q11
• Do you feel comfortable with your analysis
results by KAKENHI subject classification scheme
in accordance with your experience?
User role in the institution Yes (multiple
answers possible)
RA (research administrator) 20
Administrator / officer 3
IR (institutional research) staff 5
Others 2
Other: 1, I need more detail categories
36
37. Discussion (1)
• Our approach, i.e. deciding a correspondence between two subject
classification schemes has an inherent limitation.
• In natural correlations between subject categories of two subject classification
schemes, each subject category of one scheme partly overlaps several subject
categories of the other scheme.
• There is no inclusion relationship between them.
• Correspondence relations are probabilistic.
• Research projects and journal articles have similarities and differences on
subject.
• Projects and articles have a strong correlation on subject.
• But, they also have differences on subject.
• Projects precede articles.
• Projects tend to indicate the central concept with essential keywords.
37
38. Discussion (2)
• Nevertheless, the classification results were accepted by InCites
users.
• Our approach requires less workload .
• The numbers of journal titles in Web of Science citation database is 24,688.
• The number of Web of Science documents of InCites is 58,395,008.
• The number of subject category pairs to decide a correspondence is 16,817.
• For KAKEN 67 - WoS 251, the number of the pairs is 16,817.
• For KAKEN 10 - WoS 251, the number of the pairs is 2,510.
• But, evidence data is not sufficient to automatic decision making.
• The sum of frequency counts of the contingency table is 97,175.
• Manual handling was needed.
38
39. Conclusions and future work
• Conclusions
• We proposed an approach to apply a new subject classification scheme for a bibliographic
database that is already classified by using a subject classification scheme.
• We gave a fundamental analytical model of subject classification scheme based on set theory.
• Compact topological space formation for a new subject classification scheme is a necessary condition.
• An external database, e.g. research project database is utilized to induce a correspondence between the
two subject classification schemes.
• We applied the approach to a practical example, InCites™ that is a research evaluation tool
based on the Web of Science citation database to add the subject classification scheme of
Japan’s largest national grants KAKENHI. The user survey indicates that users generally accept
the new function.
• Future work
• For a complex classification scheme such as a hierarchical classification scheme, our
approach should be extended to be applied to its character.
• Alternatively, multilabel learning is another possible method to aim at our goal. We need to
compare it to our method.
39
40. Acknowledgments
• This presentation is a result of a joint research between National
Institute of Informatics and Clarivate Analytics, Co., Ltd. As for the
databases we used in this presentation, the KAKEN database is
provided by National Institute of Informatics, Cyber Science
Infrastructure Development Department, Scholarly and Academic
Information Division, and the Web of Science citation database is
provided by Clarivate Analytics, Co., Ltd. We are thankful to the
organizations who let us use the valuable assets.
40
Editor's Notes
intersection, set difference, union,
Another set of categories is unknown for the set S.
We want to specify the compact topological space for S.
As a data-driven approach, …
Given a research project database, we can observe compact topological spaces for the two subject classification schemes.
Strategic position to induce a correspondence is to maximize the F-measure.
The pseudo metrics are different from the original metrics because of subadditivity.