Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence

Application of
a Novel Subject Classification Scheme
for a Bibliographic Database
Using a Data-Driven Correspondence
Kei Kurakawa, Yuan Sun
National Institute of Informatics, Japan
Satoko Ando
Clarivate Analytics (Japan) Co., Ltd.
This is the presentation slides for the workshop BigScholar 2019 in conjunction with CIKM 2019 (ACM International Conference
on Information and Knowledge Management) Nov 7, 2019, at CNCC, Beijing, China.
Citation: Kurakawa K, Sun Y and Ando S (2020) Application of a Novel Subject Classification Scheme for a Bibliographic Database
Using a Data-Driven Correspondence. Front. Big Data 2:48. doi: 10.3389/fdata.2019.00048

Overview
• Introduction
• Motivation
• Applying a new subject classification scheme for a subject-classified bibliographic database
• Our main contributions
• Related work
• Theoretical background
• Subject classification model of the bibliographic database based on set theory
• Main steps of our data-driven approach
• Case study
• Applying the Japanese grants KAKENHI subject classification scheme for the Web of
Science citation database
• Conclusions and future work
2

Motivation
• In assessing research activities based on bibliometrics, analysts are
accustomed to use the major citation database Web of Science whose
subject classification schemes, i.e. WoS Subject Category, ESI, and
GIPP are prepared for qualitative analysis.
• Analysts need domestic subject classification schemes for their
analysis, which are not implemented on the database.
• Applying a new classification scheme for the database by hand is too
much labor intensive and time consuming task.
• How can we apply a new classification scheme for the database,
efficiently and effectively?
3

Our main contributions
• We propose an approach to apply a novel subject classification
scheme for a subject-classified database using a data-driven
correspondence between the new and present ones, which is
accustomed to digital libraries.
• We give a fundamental analytical model of subject classification
scheme based on set theory and describe compact topological space
formation for a new subject classification scheme as a necessary
condition.
• We demonstrate the effectiveness and efficiency of our approach to a
practical bibliographic database.
4

Related work
• In the field of computer science,
• Information retrieval
• Data mining
• Digital libraries
• Automated text categorization
• Classification (supervised learning)
• Naïve bays classification
• Neural networks
• Support vector machines
• Clustering (unsupervised learning)
• K-means
• Expectation maximization (EM)
• Hierarchical agglomerative clustering
• Divisive clustering
• Matrix decompositions
• More problem specific method
• Multi-label classification / multi-label
learning, based on
• SVM
• Deep learning
• Ensemble classification.
• Extreme multi-label classification, based on
• Graph embedding
• Convolutional neural network (CNN)
• Attention model of neural networks
• Label hierarchy considered
• A method of mapping between different
classification schemes
• Importing cataloguing records using a
different classification scheme in digital
libraries
• Information integration on the Web
5

Theoretical background
• Subject classification model of the bibliographic database
• Compact topological space formation for a new subject classification
scheme
• Inducing a correspondence between two subject classification
schemes using a research project database
6

Given a bibliographic database
𝑆
7

Given two categories
𝑂1 𝑂2
𝑆
8

Given two categories
𝑂1 𝑂2
𝑆
Analytical basis
9

Given many categories
𝑆
10
𝔒(1)
= {𝑂𝑖}

Given many categories
𝑆
𝔒(1)
= {𝑂𝑖}
induces
Analytical basis
Topology
𝑆, 𝔒 1
𝔒 1 = {any unions of analytical basis}11

Given a finite cover
𝑆
Compact topological space12
𝔒(1)
= {𝑂𝑖}
𝑆, 𝔒 1

Another set of categories
𝑆
Compact topological space13
𝔒(2)
= {𝑂𝑖}
𝑆, 𝔒 2

If we have an external database such as …
• Research project database
𝑆′
𝑆
𝑇
𝑏
𝑂
ℎ
articles
projects
Compact topological space
𝑇, 𝔒 𝑇
2
14

If we have an external database such as …
• Research project database
𝑆′
𝑆
𝑇
𝑏
𝑂
ℎ
articles
projects
𝑇, 𝔒 𝑇
2
𝑆′
, 𝔒 𝑆′
2
We can define a compact
topological space for the
second set of categories.
15

Compact topological spaces for the two
subject classification schemes
𝑆′
𝔒 1
= {𝑂𝑖
1
}
𝑆′
, 𝔒 𝑆′
2
𝑆′
, 𝔒 𝑆′
1
𝔒 2
= {𝑂𝑖
2
}
16

Correspondence between two subject
classification schemes
𝑂𝑗
2
𝑂1
1
𝑂2
1
𝑂3
1 𝑂4
1
𝑂5
1
𝑂6
1𝑂7
1
17

Metrics for inducing a correspondence
Maximizing 𝐹𝛽-measure.
𝑑 𝑝𝑗 =
𝑖∈𝐼
𝑗
1 𝑂 𝑗
2
∩𝑂𝑖
1
𝑖∈𝐼
𝑗
1 𝑂𝑖
1 (precision),
𝑑 𝑟𝑗 =
𝑖∈𝐼
𝑗
1 𝑂 𝑗
2
∩𝑂𝑖
1
𝑂 𝑗
2 (recall),
𝑑 𝑓𝑗 =
1+𝛽2 𝑑 𝑝𝑗 𝑑 𝑟𝑗
𝛽2 𝑑 𝑝𝑗+𝑑 𝑟𝑗
, 𝛽 > 0, (𝐹𝛽-measure).
18

In practical cases, we often build a contingency
table for the two subject classification schemes
𝑂1
1
⋮
𝑂𝑖
1
⋮
𝑂 𝑚
1
𝑂1
2
⋯ 𝑂𝑗
2
⋯ 𝑂 𝑛
2
𝑓11 ⋯ 𝑓1𝑗 ⋯ 𝑓1𝑛
⋮
𝑓𝑖1
⋮
⋱ ⋮
⋯ 𝑓𝑖𝑗 ⋯
⋮ ⋱
⋮
𝑓𝑖𝑛
⋮
𝑓 𝑚1 ⋯ 𝑓 𝑚𝑗 ⋯ 𝑓𝑚𝑛
The first set of
subject categories
The second set of subject categories
𝑓𝑖𝑗 = 𝑂𝑖
1
∩ 𝑂𝑗
2
19

Pseudo metrics for inducing a
correspondence
Maximizing pseudo 𝐹𝛽-measure.
𝑑 𝑝𝑗
′
=
𝑖 𝑂 𝑗
2
∩𝑂𝑖
1
𝑖 𝑂𝑖
1 (pseudo precision),
𝑑 𝑟𝑗
′
=
𝑖 𝑂 𝑗
2
∩𝑂𝑖
1
𝑂 𝑗
2 (pseudo recall),
𝑑 𝑓𝑗
′
=
1+𝛽2 𝑑 𝑝𝑗
′
𝑑 𝑟𝑗
′
𝛽2 𝑑 𝑝𝑗
′ +𝑑 𝑟𝑗
′ , 𝛽 > 0, (pseudo 𝐹𝛽-measure).
20

Main steps of our approach
1’-2. Inducing a correspondence between the two
subject classification schemes by using pseudo
𝐹𝛽-measure
1’-1. Constructing a contingency table
between two subject classification
schemes
2. Revising the correspondence to guarantee the existence of a finite
cover of the novel subject classification scheme
1. Inducing a
correspondence between
the two subject
classification schemes by
using 𝐹𝛽-measure
21

Case study
• InCites™ (Clarivate Analytics)
• A world class research evaluation platform
• Web of Science™ citation database
• Web of Science classification scheme (251 categories)
• Essential Science Indicator(ESI) classification scheme
(22 categories)
• Japanese users are eager to utilize the subject
classification scheme of Japan’s largest national
research grants KAKENHI.
• KAKEN (NII) research project database
• Archival records of research projects and the
outputs of KAKENHI grants in Japan.
• KAKENHI subject classification scheme (hierarchical
classification scheme; 4 categories, 10 areas, 67
disciplines, and 284 research fields)
22

https://images.webofknowledge.com/images/help/WOS/hp_subject_category_terms_tasca.html
WoS subject classification scheme
23

https://www.jsps.go.jp/english/e-grants/data/09_2008/21startup_yoryo2_e.pdf
KAKENHI subject classification scheme
24

Developing a contingency table as evidence
data
• We identified the same bibliographic records in the WoS citation database as of 2009 and
2010 through a set of record linkage techniques to obtain a set of articles 𝑆′ that are
classified using both the KAKENHI and WoS classification schemes.
𝑆′
𝑆𝑇
𝑏
𝑂
ℎ
Web of ScienceKAKEN
𝑎1
𝑎2
𝑎3
𝑎1
′
𝑎2
′
𝑎3
′
articles articlesprojects
Bibliographic linkage
𝑎1
′
≡ 𝑎1
𝑎2
′
≡ 𝑎2
𝑎3
′
≡ 𝑎3
25

A contingency table between WoS and
KAKENHI subject classification schemes
a part of 251 WoS categories
x 67 KAKENHI areas
26

Analysis of the contingency table
27
,where is the rank value, its maximum value,
a normalized constant
and two fitting components.
The discrete generalized beta distribution (DGBD)

28
Dispersal type
Concentration type

Maximum pseudo 𝐹1-measure for the third-level
67 disciplines of the KAKENHI subject categories
against the 251 WoS subject categories
29
The third-level
67 disciplines
seq. no.
KAKENHI subject category Translation Cardinality No. of WoS subject
categories to get the
max pseudo F1-measure
Pseudo precision Pseudo recall Max pseudo F1
measure
(l3-01) 情報学 Informatics 6637 17 0.576 0.626 0.600
(l3-02) 神経科学 Brain sciences 1570 1 0.218 0.365 0.273
(l3-03) 実験動物学 Laboratory animal science 242 1 0.059 0.074 0.066
(l3-04) 人間医工学 Human informatics 1995 8 0.222 0.213 0.217
(l3-05) 健康・スポーツ科学 Health / sports science 844 5 0.181 0.290 0.223
(l3-06) 生活科学 Human life science 467 4 0.239 0.281 0.258
(l3-07) 科学教育・教育工学
Science education /educational
technology 388 2 0.377 0.103 0.162
(l3-08) 科学社会学・科学技術史
Sociology / history of science
and technology 43 6 0.111 0.163 0.132
(l3-09) 文化財科学 Cultural assets study 55 1 0.200 0.036 0.062
(l3-10) 地理学 Geography 148 4 0.117 0.203 0.149
(l3-11) 環境学 Environmental science 2136 14 0.262 0.385 0.312
(l3-12) ナノ・マイクロ科学 Nano / micro science 1852 4 0.103 0.313 0.155
(l3-13) 社会・安全システム科学 Social / safety system science 868 14 0.187 0.214 0.199
(l3-14) ゲノム科学 Genome science 394 3 0.040 0.203 0.067
(l3-15) 生物分子科学 Biomedical engineering 875 2 0.119 0.325 0.174
(l3-16) 資源保全学 Culture assets and museology 172 3 0.181 0.145 0.161
(l3-17) 地域研究 Area studies 85 7 0.164 0.271 0.204
(l3-18) ジェンダー Gender 27 3 0.231 0.111 0.150

30
The third-level
67 disciplines
seq. no.
measure
(l3-19) 哲学 Philosophy 60 4 0.436 0.283 0.343
(l3-20) 芸術学 Art Studies 9 1 0.091 0.111 0.100
(l3-21) 文学 Literature 41 10 0.700 0.683 0.691
(l3-22) 言語学 Linguistics 239 3 0.705 0.410 0.519
(l3-23) 史学 History 82 6 0.412 0.341 0.373
(l3-24) 人文地理学 Human Geography 14 3 0.175 0.500 0.259
(l3-25) 文化人類学 Cultural Anthropology 38 3 0.056 0.105 0.073
(l3-26) 法学 Law 41 3 0.385 0.122 0.185
(l3-27) 政治学 Politics 59 2 0.409 0.458 0.432
(l3-28) 経済学 Economics 992 12 0.692 0.622 0.655
(l3-29) 経営学 Management 130 5 0.294 0.385 0.333
(l3-30) 社会学 Sociology 90 8 0.176 0.278 0.216
(l3-31) 心理学 Psychology 794 14 0.488 0.479 0.483
(l3-32) 教育学 Education 151 9 0.244 0.258 0.251
(l3-33) 数学 Mathematics 2589 4 0.734 0.792 0.762
(l3-34) 天文学 Astronomy 1005 1 0.505 0.870 0.639
(l3-35) 物理学 Physics 5199 6 0.498 0.651 0.565
(l3-36) 地球惑星科学 Earth and Planetary Science 2099 7 0.619 0.662 0.640
(l3-37) プラズマ科学 Plasma Science 508 1 0.233 0.191 0.210
(l3-38) 基礎化学 Basic Chemistry 2448 7 0.229 0.801 0.356
(l3-39) 複合化学 Applied Chemistry 3573 6 0.283 0.526 0.368
(l3-40) 材料化学 Materials Chemistry 1635 7 0.157 0.348 0.216
(l3-41) 応用物理学・工学基礎 Applied Physics 2235 5 0.170 0.394 0.238
(l3-42) 機械工学 Mechanical Engineering 2675 11 0.431 0.388 0.408
(l3-43) 電気電子工学
Electrical and Electric
Engineering 4875 10 0.338 0.669 0.449
(l3-44) 土木工学 Civil Engineering 711 8 0.371 0.484 0.420
(l3-45) 建築学
Architecture and Building
Engineering 170 3 0.286 0.506 0.365

31
Average
cardinality
Average no. of WoS
subject categories
Average pseudo
precision
Average pseudo
recall
Average pseudo
F1 measure
1450.4 6.1 0.315 0.367 0.317
The third-level
67 disciplines
seq. no.
measure
(l3-46) 材料工学 Material Engineering 2931 6 0.348 0.523 0.418
(l3-47) プロセス工学 Process/Chemical Engineering 1283 4 0.145 0.306 0.197
(l3-48) 総合工学 Integrated Engineering 1465 8 0.256 0.309 0.280
(l3-49) 基礎生物学 Basic Biology 2423 7 0.375 0.400 0.387
(l3-50) 生物科学 Biological Science 2679 4 0.167 0.582 0.259
(l3-51) 人類学 Anthropology 300 3 0.315 0.440 0.367
(l3-52) 農学 Plant Production and
Environmental Agriculture
899 4 0.307 0.449 0.365
(l3-53) 農芸化学 Agricultural Chemistry 1755 6 0.220 0.386 0.281
(l3-54) 林学 Forest and Forest Products
Science
559 5 0.408 0.252 0.312
(l3-55) 水産学 Applied Aquatic Science 581 2 0.419 0.327 0.367
(l3-56) 農業経済学 Agricultural Science in Society
and Economy
31 2 0.333 0.097 0.150
(l3-57) 農業工学 Agro-Engineering 216 4 0.157 0.259 0.195
(l3-58) 畜産学・獣医学 Animal Life Science 1190 4 0.511 0.387 0.440
(l3-59) 境界農学 Boundary Agriculture 541 4 0.235 0.148 0.181
(l3-60) 薬学 Pharmacy 3457 4 0.294 0.369 0.328
(l3-61) 基礎医学 Basic Medicine 5232 16 0.213 0.551 0.307
(l3-62) 境界医学 Boundary Medicine 850 12 0.162 0.112 0.132
(l3-63) 社会医学 Society Medicine 1065 8 0.282 0.262 0.271

Miscellaneous considerations
• Decision by an expert
• Limit the number of correspondence to 1 – 4 for 𝑂𝑖
1
.
• For every Web of Science subject category 𝑂𝑖
1
, the number of relations with KAKENHI
subject categories 𝑂𝑗
2
is limited to 4 at most.
• For every Web of Science subject category 𝑂𝑖
1
, when the recall rate exceeds a half, we
stop adding any more relation.
• Check all correspondence between 𝑂𝑖
1
and 𝑂𝑗
2
.
• Add or remove correspondence relations between them by means of subject
classification keywords.
32

http://help.prod-incites.com/inCites2Live/filterValuesGroup/researchAreaSchema/kaken/version/16
Inducing a correspondence between KAKENHI
subject classification scheme and WoS subject
classification scheme
10 areas of KAKENHI (a part of the mapping list) 67 disciplines of KAKENHI (a part of the mapping list)
33

Example screen of InCites™
34
WoS Documents: 58,395,008
for Web of Science subject categories
for Web of Science subject categories
limited with
“LOCATION = JAPAN”
for KAKEN L3 subject categories
limited with
“LOCATION = JAPAN”
(a snapshot of 2018-12-14)
The bubbles representing
proportional numbers of
articles classified using the
KAKENHI subject categories

Top 30 subject distribution of Japanese authors’
articles with the two subject classification
schemes
WoS subject classification scheme KAKENHI subject classification scheme
35

User feedback: Questions and answers on the
validity of the KAKENHI subject classification
scheme
• KAKEN classification scheme
• April 2016, released on InCites Benchmarking
• User survey
• March 2017 by online questionnaire for
institutional active users
• 18 questions
• Results
• 26 institutional users feedback
• Q7
• Which levels of hierarchy in KAKENHI subject
classification scheme do you need?
• Q11
• Do you feel comfortable with your analysis
results by KAKENHI subject classification scheme
in accordance with your experience?
User role in the institution Yes (multiple
answers possible)
RA (research administrator) 20
Administrator / officer 3
IR (institutional research) staff 5
Others 2
Other: 1, I need more detail categories
36

Discussion (1)
• Our approach, i.e. deciding a correspondence between two subject
classification schemes has an inherent limitation.
• In natural correlations between subject categories of two subject classification
schemes, each subject category of one scheme partly overlaps several subject
categories of the other scheme.
• There is no inclusion relationship between them.
• Correspondence relations are probabilistic.
• Research projects and journal articles have similarities and differences on
subject.
• Projects and articles have a strong correlation on subject.
• But, they also have differences on subject.
• Projects precede articles.
• Projects tend to indicate the central concept with essential keywords.
37

Discussion (2)
• Nevertheless, the classification results were accepted by InCites
users.
• Our approach requires less workload .
• The numbers of journal titles in Web of Science citation database is 24,688.
• The number of Web of Science documents of InCites is 58,395,008.
• The number of subject category pairs to decide a correspondence is 16,817.
• For KAKEN 67 - WoS 251, the number of the pairs is 16,817.
• For KAKEN 10 - WoS 251, the number of the pairs is 2,510.
• But, evidence data is not sufficient to automatic decision making.
• The sum of frequency counts of the contingency table is 97,175.
• Manual handling was needed.
38

Conclusions and future work
• Conclusions
• We proposed an approach to apply a new subject classification scheme for a bibliographic
database that is already classified by using a subject classification scheme.
• We gave a fundamental analytical model of subject classification scheme based on set theory.
• Compact topological space formation for a new subject classification scheme is a necessary condition.
• An external database, e.g. research project database is utilized to induce a correspondence between the
two subject classification schemes.
• We applied the approach to a practical example, InCites™ that is a research evaluation tool
based on the Web of Science citation database to add the subject classification scheme of
Japan’s largest national grants KAKENHI. The user survey indicates that users generally accept
the new function.
• Future work
• For a complex classification scheme such as a hierarchical classification scheme, our
approach should be extended to be applied to its character.
• Alternatively, multilabel learning is another possible method to aim at our goal. We need to
compare it to our method.
39

Acknowledgments
• This presentation is a result of a joint research between National
Institute of Informatics and Clarivate Analytics, Co., Ltd. As for the
databases we used in this presentation, the KAKEN database is
provided by National Institute of Informatics, Cyber Science
Infrastructure Development Department, Scholarly and Academic
Information Division, and the Web of Science citation database is
provided by Clarivate Analytics, Co., Ltd. We are thankful to the
organizations who let us use the valuable assets.
40

Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence

Similar to Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence (20)

More from National Institute of Informatics

More from National Institute of Informatics (19)

Recently uploaded

Recently uploaded (20)

Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence

Editor's Notes