The document summarizes a research paper that proposed a link prediction model for citation networks. It applied support vector machines (SVMs) as the classifier and used 11 features optimized for citation networks across 5 academic fields. The model was able to better predict links compared to just using the classifier's performance metrics. However, the effective features varied by academic field, suggesting different models should be applied for different research areas.
Designing Great Products: The Power of Design and Leadership by Chief Designe...
論文サーベイ(Sasaki)
1. 論文サーベイ@研究会2012.04.19:
N.Shibata,
Y.Kajikawa,
I.Sakata,
“Link
Predic?on
in
Cita?on
Networks”
Journal
of
the
American
society
for
informa?on
science
and
technology,
63(1):
78-‐85,
2012
佐々木一
Hajime
SASAKI
政策ビジョン研究センター 特任研究員
工学系研究科総合研究機構イノベーション政策研究センター 連携研究員
技術経営戦略学専攻 坂田一郎研究室 協力研究員
3. Introduc?on
• The
number
of
academic
papers
exponen?ally
increases
(Price,
1965),
each
academic
area
becomes
specialized
and
segmented.
• The
individual
scien?st
has
to
focus
on
or
specialize
in
only
a
few
scien?fic
subdomains
to
keep
up
with
the
growth
of
the
domains,
which
means
that
researchers
must
focus
on
increasingly
narrowing
domains.
Research
Ques?on:
What
factors
affect
the
existence
of
links
using
features
intrinsic
to
the
network
itself,
namely,
link
predic+on,
which
will
help
scholars
to
know
which
paper
to
cite
and
managers
to
iden?fy
future
core
papers?
• In
this
ar?cle,
The
authors
u?lize
textual,
topological,
and
abribute
features
for
link
predic?on,
which
are
considered
to
influence
ci?ng
behaviors.
4. 既存研究
• Liben-‐Nowell
and
Kleinberg
(2003)
:
proposed
a
model
for
link
predic?on
in
large
coauthorship
networks.
• Clauset,
Moore,
and
Newman
(2008):
inves?gated
the
hierarchical
structure
of
social
networks
to
predict
missing
connec?ons
in
par?ally
known
networks
with
high
accuracy.
• Popescul
and
Ungar
(2003):
proposed
a
new
approach
for
Sta?s?cal
Rela?onal
Learning
to
build
link
predic?on
models.
• Hasan,
Chaoji,
Salem,
and
Zaki
(2006):
tested
several
supervised
learning
models
(decision
tree,
k-‐nearest
neighbor,
mul?layer
percep?on,
support
vector
machine
[SVM],
radial
basis
func?on
[RBF]
network)
for
link
predic?ons
• Murata
and
Moriyasu
(2008):
applied
the
model
of
Liben-‐Nowell
and
Kleinberg
to
social
networks
of
Ques?on-‐Answering
Bulle?n
Boards.
• Caragea,
Bahirwani,
Aljandal,
and
Hsu
(2009)
:proposed
an
algorithm
to
predict
poten?al
friendships
based
on
a
clustering
approach
in
Live-‐
Journal,
a
social
network
journal
service
with
a
focus
on
user
interac?ons.
• Lu,
Jin,
and
Zhou
(2009)
:presented
a
local
path
index
to
es?mate
the
likelihood
of
the
existence
of
a
link
between
two
nodes.
• Seglen
(1994)
:analysed
the
trends
of
papers
in
the
journals
with
large
impact
factors.
• Vinkler
and
Davidson
(2002)
:indicated
that
the
papers
in
growing
journals
in
terms
of
the
number
of
papers
are
more
likely
to
be
cited.
• Hwang,
Wylie,
Wei,
and
Liao
(2010):
proposed
recommenda?on
engines
based
on
the
coauthorship
networks.
5. 本研究の特徴
• 1:The
focus
is
on
cita?on
networks.
引用ネットワークに着目した。
• 2:The
authors
apply
SVMs
as
our
supervised
learning
method,
as
SVM
is
the
best
learner
according
to
Hasan
et
al.
(2006).
教師あり学習における分類器としてSVMを利用した。
• 3:
The
authors
use
more
comprehensive
features
op?mized
for
cita?on
networks.
引用ネットワークを対象するにあたって、網羅的な素性を適
用した。
6. 本研究の意義
• Helps
us
make
decisions
whether
to
link
more
accurately
even
with
a
huge
number
data.
• Applica?on:引用推薦システムを構築する
Cita?on
recommenda?on
system
for
authors
of
scien?fic
publica?ons
and
patents.
– The
reviewers
of
scien?fic
papers
can
reduce
their
?me
to
check
whether
the
references
in
those
papers
are
adequate
or
not.(査読に
おいて、適切な論文を引用しているかどうかを効率的に判断できる)
– Second,
well-‐
organized
link
predic?on
can
reveal
how
and
why
authors
cite
other
scien?fic
papers.
(著者が引用した理由がわかる)
– Finally,
link
predic?on
can
bond
different
research
fields
with
similar
topics
but
from
different
disciplines.(類する問題を扱っている異なる
学術分野をつなぐことができる)
10. オーバーフィッティング
A
B
オーバーフィットして,サンプル(パラメー
タ)を増やしても真の解に近づかない。
なめらかさなどの制約をおいて対処する
(正則化)
C
予測モデルは
シンプルにしたい。
11. and w = (w1 , w2 , . . . , wd ) is the parameter vector of the same
dimension that specifies the model. A positive value of wj
indicates that the j-th feature xj positively contributes to the
prediction, while a negative value contributes to it negatively.
できるだけ確信度を持って間違いを少なく
The sign function returns +1 when its argument is positive,
するという項(損失)と、できるだけシンプル
and returns −1 otherwise. Given the data set X and Y , the
なモデルを採用するという項(正則化項)の
SVM learning algorithm finds the optimal parameter w∗ that
和を最小化したい。
minimizes the following objective function:
max{1 − yi h(xi ), 0} + c w 2 ,
2
i
損失関数:間違った判別の 正則化項:
際にペナルティ。
学習データに対して過度に適応して
FORMATION SCIENCE AND TECHNOLOGY—January 2012 79
しまうと、未知のデータに対する性能
DOI: 10.1002/asi
(汎化性能)が逆に落ちてしまう
オーバーフィッティング防止。
全体を最小にするようなパラメータ(ウェイト)を決めたい。
12. 素性 (全部で11種)
Topological
Features
• (1)
The
number
of
common
neighbours.
(共通ノード数)
• (2)
Link-‐based
Jaccard
coefficient.
(共通ノードの割合)
• (3)
Difference
in
betweenness
centrality.(媒介中心の高いnodeを引用)
• (4)
Difference
in
the
number
of
in-‐links.
(リンク数が多いnode引用)
• (5)
Is
same
cluster(同じクラスタ内かどうか)
Seman3c
Features
• (6)
Cosine
similarity
of
term
frequency–inverse
document
frequency
(M–idf)
vectors.(同じ意味的特徴を有しているか)
A5ribute
Features
• (7)
Difference
in
publica+on
year.(最近のものは良く引用される)
• (8)
The
number
of
common
authors.(共通著者数)
• (9)
Is
self
cita+on.(同じ著者)
• (10)
Is
published
in
same
journal.(同じジャーナルかどうか)
• (11)
Number
of
+mes
“to”
cited.(富めるものはますます富む)
13. Dataset
TABLE 2. Datasets of citation networks.
Dataset Query Published through No. of papers No. of citations
A Innovation innovation* 2009 20,564 106,619
B Nano Bio nano* and bio* 2009 33,830 175,875
C Organic LED ((organic* or polymer*) and (electroluminescen* or 2009 19,486 196,123
electro-luminescen* or electro luminescen* or
light emitting or LED*)) or OLED*
D Solar Cells solar cell* 2008 18,587 111,051
E Secondary Batteries (*) ((secondary or storage or rechargeable or reserve) 2008 20,430 145,008
and cell*) or batter*
Data and Experiment TABLE 3. Prediction results.
In this article, five large-scale citation datasets, Innovation, Dataset Precision Recall F1
Nano Bio, Organic LED, Solar Cells, and Secondary Batter- A Innovation 0.75 0.91 0.82
ies, are collected as shown in Table 2. We searched databases B Nano Bio 0.83 0.76 0.79
of academic papers and patents using the same query for each C Organic LED 0.79 0.71 0.74
domain. The databases of academic papers used are the Sci- D Solar Cells 0.76 0.72 0.74
ence Citation Index Expanded (SCI-EXPANDED), the Social E Secondary Batteries 0.80 0.77 0.77
Sciences Citation Index (SSCI), and the Arts & Humanities
Citation Index (A&HCI) compiled by the Institute for Sci-
entific Information (ISI). After collecting data, we extracted 4. We repeated step 3 five times in total with different choice
the papers and citations in the largest-graph component to of answer set.
14. Cross
Valida?on(交差検定)
• 1.
These
exis?ng
cita?ons
are
divided
into
five
groups
(posi?ve
instances,
namely,
P[1]
to
P[5]).
• 2.
We
randomly
created
the
same
number
of
pair
where
cita?ons
did
not
exist
(nega?ve
instances,
namely,
N[1]
to
N[5]).
• 3.
In
the
first
experiment,
P[2]
to
P[5]
and
N[2]
to
N[5]
were
used
as
the
training
data
and
P[1]
and
N[1]
were
used
as
the
test
data.
• 4.
We
repeated
step3
five
?mes
in
total
with
different
choice
of
answer
set.
引用有りデータ
引用無しデータ
テストデータ
学習データ
テストデータ
学習データ
1回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
2回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
3回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
4回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
5回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
16. 2008 18,587 111,051
e or reserve) 2008 20,430 145,008
Result
TABLE 3. Prediction results.
on, Dataset Precision Recall F1
er- A Innovation 0.75 0.91 0.82
ses B Nano Bio 0.83 0.76 0.79
ach C Organic LED 0.79 0.71 0.74
ci- D Solar Cells 0.76 0.72 0.74
ial E Secondary Batteries 0.80 0.77 0.77
ies
ci- f-‐value:
0.74~0.82:
ted 4. We repeated step 3 five times in total with different choice
to
Based
on
the
results,
we
obtained
the
learning
of answer set.
led model
on
our
training
data.
As a learner, we employed L2-regularized and L2-loss
D,
17. Weights
of
features
Posi?ve
contribu?on:
>= 0.5
Nega?ve
contribu?on:
<= -‐0.5
TABLE 4. Weights of features. No
contribu?on:
-‐0.5~0.5
E. Secondary
Features A. Innovation B. Nano Bio C. Organic LED D. Solar Cells Batteries
1. No. common neighbors 0.566 0.889 0.520 0.683 0.987
2. Link-based Jaccard coefficient 1.354 2.198 −6.150 −0.703 −4.742
3. Difference in betweenness centrality −1.446 −6.107 −2.175 −5.468 −10.049
4. Difference in the number of in-links 0.052 0.033 0.034 0.045 0.047
5. Is same cluster 0.018 0.086 −0.308 −0.160 −0.062
6. Cosine similarity of tf-idf vectors −19.897 −17.817 −15.527 1.624 1.519
7. Difference in publication year 0.018 0.046 0.032 0.009 0.008
8. The number of common authors −0.112 0.476 0.403 0.152 0.036
9. Is self-citation 1.975 0.756 0.605 0.865 0.918
10. Is published in same journal 0.726 0.614 0.198 0.027 −0.108
11. Number of times “to” cited −0.018 −0.019 −0.015 −0.031 −0.033
・Especially,
(2),
(3)
and
(6)
largely
affected
the
predic?ons
of
cita?ons.
・(2):
(A)
(B)
comprise
mul?ple
research
fields
and
most
cita?ons
are
in
each
research
field
so
that
papers
ofite
locally.
(C),
(D)
and
(E)
are
contained
in
a
research
field
with
a
single.
cases, because
the existence c a citation with a probability from 74% to of common neighbours positively affected all
・(3):
igiven are
that
core
nodes
and
citation network. thewhich
have
different
values
of
have, the more
82%, t
is
r a pair of papers and the entire peripheral
nodes,
more common neighbours two papers
Especially three features, (2) link-based Jaccard coefficient, related they are. That the self-citation result had a posi-
betweenness
centrality,
centrality, andin
the
cita?on
ntive effect is reasonable because authors tend to cite their
(3) difference in betweenness
are
linked
(6) cosine sim- etworks.
・(6):
same
as
vectors, largely affected the predictions of own papers. The feature of is published in the same jour-
ilarity of tf–idf (3)
・(1):
the
more
common
neighbours
two
papers
have,
affectedore
r(A) Innovations are.
and (B) Nano Bio
citations. nal the
m only elated
they
(0.726)
・(9):
because
authors
tend
tcontributed positivelypin (0.614) positively. Similar to the result of link-based Jaccard
Link-based Jaccard coefficient o
cite
their
own
apers.
the cases of (A) Innovations (weight: 1.354) and (B) Nano coefficient, papers tend to cite in each research field in the
・(10):
same
anegatively in the cases of (C) Organic LED case of research fields with multiple issues.
Bio (2.198) but s
(3),(6)
(−6.150), (D) Solar Cells (−0.703) and (E) Secondary In summary, different models are required for differ-
Batteries (−4.742). These results indicate that the former ent types of research areas—research fields with a single
research areas, such as (A) Innovations and (B) Nano Bio, issue or research fields with multiple issues. In the case
18. Summary
• It
is
difficult
to
build
a
universal
learner
for
link
predic?on
and
we
need
to
build
learners
based
on
the
characteris?cs
of
each
research
domain.
• Different
models
are
required
for
different
types
of
research
areas—research
fields
with
a
single
issue
or
research
fields
with
mul?ple
issues.
– The
first
one
is
the
research
field
with
mul?ple
issues
such
as
(A)
Innova?ons
and
(B)
Nano
Bio.
– The
second
one
is
a
simple
research
field
type
with
commonly
understood
targets
of
research
and
development
such
as
(C)
Organic
LED,
(D)
Solar
Cells
and
(E)
Secondary
Baberies.