Analysis and Modeling of Complex Data in Behavioral and Social Sciences
Joint meeting of Japanese and Italian Classification Societies
Anacapri (Capri Island, Italy), 3-4 September 2012
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web
1. Analysis and Modeling of Complex Data in Behavioral and Social Sciences
Joint meeting of Japanese and Italian Classification Societies
Anacapri (Capri Island, Italy), 3-4 September 2012
A SVM Applied Text Categorization of
Academia-Industry Collaborative
Research and Development Documents
on the Web
Kei Kurakawa1, Yuan Sun1,
Nagayoshi Yamashita2, Yasumasa Baba3
1. National Institute of Informatics
2. GMO Research
(ex- Japan Society for the Promotion of Science)
3. The Institute of Statistical Mathematics
2. U-I-G relations
• To make a policy of science
and technology research and
U
development, university-
industry-government (U-I-G)
relations is an important aspect I
G
to investigate it (Leydesdorff
and Meyer, 2003).
• Web document is one of the research targets to
clarify the state of the relationship.
• In the clarification process, to get the exact
resources of U-I-G relations is the first
requirement.
2
3. Objective
• Objective is to extract automatically
resources of U-I relations from the web.
U
I
G
• We set a target into “press release
articles” of organizations, and make a
framework to automatically crawl them and
decide which is of U-I relations. 3
4. Automatic extraction framework for
U-I relations documents on the web
Press
release
ar7cles
published
on
university
or
company
web
site
1.
Crawling
Web
Crawled
Documents
Documents
2.
Extrac7ng
Text
From
the
Extracted
Documents
Texts
3.
Learning
to
Learned
Classify
the
4.
Classifying
the
Model
Document
Document
File
4
5. Support Vector Machine (1)
(Vapnik, 1995)
y=1
• Two class classifier y=0
y(x) = wT (x) + b y= 1
Bias parameter
Fixed feature space transformation
• N input vectors
margin
– Input vector: x1 , . . . , xN
– Target values: t1 , . . . , tN where tn 2 { 1, 1} Support Vector
• For all input vectors, tn y(xn ) > 0
• Maximize margin between
hyperplane y(x) = 1 and y(x) = 1
5
6. Support Vector Machine (2)
• Optimization problem
1 2
arg min kwk .
w,b 2
T
subject to the constraints
tn (w (x) + b) 1, n = 1, . . . , N
• By means of Lagrangian method
N
X
y(x) = an tn k(x, xn ) + b.
n=1
where kernel function is defined by
k(x, x0 ) = (x)T (x0 )
,and an > 0 is Lagrange multipliers
6
7. U-I relations documents on the web
• Extracted texts from the web documents are very
noisy for content analysis.
– Irrelevant text, e.g. menu label text, header or footer of
page, ads are still remained.
• In our observation,
– irrelevant text tends to be solely term not in a sentence,
– in terms of detecting U-I relations, the exact evidence of
relevance are occurred in two or three sequential and
formal sentences.
• For example, ”the MIT researchers and scientists from
MicroCHIPS Inc. reported that... ”,
• target of Japanese ”東京大学とオムロン株式会社は、共同研究に
より、重なりや隠れに強く....”
• It’s enough to filter text including punctuation marks
which means fully formal sentence.
7
8. Feature selection
• tf-idf (Term Frequency – Inverse Document
Frequency)
• tf-idf is defined by tf-idf(t, d, D) = tf(t, d) ⇥ idf(t, D)
a term
a document
all document
• Feature is defined by
xt,d = tf-idf(t, d, D) ⇥ bt,d
xd = (xt1 ,d , xt2 ,d , · · · , xtM ,d ) ⇢
1 if t 2 d
bt,d =
0 if t 2 d
/
• The term can be a term in a document, type of
POS (part-of-speech) of morpheme, or analytical
output of external tools in our experiment. 8
10. Features (1)
1) BoW
– Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word
tf-idf consists of feature vector xn.
2) BoW(N)
– Only noun is chosen.
3) BoW(N-3)
– The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed
by adding ”する” ([suru], do) to the noun).
4) K(14)
– Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu],
research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成
功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受
賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration),
”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku],
UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government)
relations), and ”連携” ([renkei], coordination).
5) K(18)
– K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),
”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).
10
11. Features (2)
6) K(18)+NM
– Keywords and POS (Part of Speech) of the next morpheme in a sequential text are
checked, in that grammatically connections of those keywords are restricted to verb,
auxiliary verb, and Sahen-noun.
7) Corp.
– Cooperation marks.
– ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U
+3231), (株),or (株) .
8) Univ.
– University name is checked.
– ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)
9) C.+U.
– Both cooperation mark and university name are being in a sentence.
10) ORG
– The existing of organization by means of Cabocha’s Japanese named entity
extraction function
11
15. Findings and discussion (1)
• In the test ID 1- 1, 1-2, 1-3, feature elements
consists of BoW which count over 15800, 13000,
and 12000 respectively. The f-measures are
worse than the other features with the same linear
kernel function. They seem to be out of learning.
• The reason why they are failed in learning can be
that training data size is too much smaller than
enough to learn. If we have enough size of training
data, it becomes larger than feature vector size.
This means training data size surpass the number
of basis function of SVM, so that learning could be
done without over-fitting.
15
16. Findings and discussion (2)
• In the test ID from 2-1 to 8-3, feature
element size is about 14 to 33.
• Accuracy and f-measure are gradually
inclined while feature elements are
additionally complex.
16
17. Findings and discussion (3)
• Test ID 7-* and 8-* is related to an occurrence of
university and company symbols. Especially in ID
7-3, recall and f-measure become highest. This
means the occurrence of the two symbols in a
sentence is sensitive to U-I relations.
• Kernel function type strongly depends on scores.
• Parameters of kernel function and efficiency of
loss function affect balance between precision and
recall rate. of Radial Basis Function is decided
to get highest F-value under cross validation for
this experiment.
17
18. Conclusion and future work
• To extract automatically resources of U-I relations from the web,
– we set a target into “press release articles” of organizations,
– Classification technique, i.e. support vector machine (SVM) is adapted
to the decision.
• We have conducted an experiment for several combinations of
feature vector elements and kernel function types of SVM.
• The combinations reveal that
– U-I relations keywords,
– university and company symbols in a sentence
are effective elements for features.
• Parameters of SVM is tuned to get higher f-measure, which also
affect balance between precision and recall rate.
• Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I
relations documents on the web.
• In future work, we build the classifier in a context clawer to
automatically crawl press release Web sites of organizations and get
more resources. 18