The document discusses the development of a thesaurus of classical Japanese poetic vocabulary. It outlines how the thesaurus was created by analyzing poems from the Hachidaishu anthologies using techniques like tokenization, meta-code conversion, and matching original poems to scholarly translations to extract vocabulary terms and their meanings over time. The goal is to better understand the connotation and historical transition of classical poetic words in a longitudinal study.
The document provides an outline for Hilofumi Yamamoto's research and teaching. It summarizes his educational background, research interests, and contributions to students at Wollongong University. His research focuses on Japanese vocabulary and language teaching methods. Specific areas of research include the study of connotation and computer modeling of vocabulary using corpus linguistics techniques.
This document appears to be notes from a lecture or presentation on natural language processing and text mining techniques. It discusses topics like inverse document frequency, co-occurrence analysis, and graph-based representations of word relationships. Tables and graphs are included to illustrate co-occurrence patterns between words and how they are represented visually. The document also references various authors and their work related to semantics, meaning, and textual analysis.
The document discusses the development of a thesaurus of classical Japanese poetic vocabulary. It outlines how the thesaurus was created by analyzing poems from the Hachidaishu anthologies using techniques like tokenization, meta-code conversion, and matching original poems to scholarly translations to extract vocabulary terms and their meanings over time. The goal is to better understand the connotation and historical transition of classical poetic words in a longitudinal study.
The document provides an outline for Hilofumi Yamamoto's research and teaching. It summarizes his educational background, research interests, and contributions to students at Wollongong University. His research focuses on Japanese vocabulary and language teaching methods. Specific areas of research include the study of connotation and computer modeling of vocabulary using corpus linguistics techniques.
This document appears to be notes from a lecture or presentation on natural language processing and text mining techniques. It discusses topics like inverse document frequency, co-occurrence analysis, and graph-based representations of word relationships. Tables and graphs are included to illustrate co-occurrence patterns between words and how they are represented visually. The document also references various authors and their work related to semantics, meaning, and textual analysis.
1. The document discusses methods for analyzing the relationships between terms in a corpus using measures like co-occurrence weight (cw) and inverse document frequency (idf).
2. It presents formulas for calculating cw, cidf, ctf, and ictf to capture term associations based on frequency of co-occurrence.
3. Tables of term pairs are provided with their calculated measure values to demonstrate the methods. The highest scoring pairs may indicate stronger semantic relations.
1. This document presents an analysis of term weighting methods for information retrieval and text mining.
2. It examines inverse document frequency (idf), collection term frequency (ctf), and co-occurrence weight (cw) as term weighting schemes.
3. The results show that cw, which combines ctf, idf, and co-occurrence information, outperforms other term weighting methods by better representing term importance and relevance to documents.
1. The document summarizes research on analyzing the co-occurrence patterns of words in a large corpus of documents.
2. It finds that the number of high co-occurrence weight patterns between words is much smaller than the number of low co-occurrence weight patterns.
3. The document also presents examples of words that have high and low co-occurrence weights based on an analysis of a corpus of documents.
The document discusses:
1. The development of a thesaurus of classical Japanese poetic vocabulary to better understand the connotations of words over time and how their usage changed.
2. The thesaurus is being developed using materials from the Hachidaishu, eight anthologies of Japanese poetry compiled between 905-2105 CE.
3. The thesaurus development involves processing the poetry data through a tokenizer, code converter, and other tools to extract and categorize the vocabulary terms according to their attributes.
1. The document discusses methods for calculating weights for terms in documents, including term frequency (tf), inverse document frequency (idf), and weighted schemes that combine tf and idf like tfidf.
2. It provides examples of calculating idf values for specific terms and illustrates how idf values increase as terms appear in fewer documents.
3. Tables show ranked lists of term pairs based on their calculated co-occurrence weight (cw) values, which factor in co-occurrence frequency, idf, and co-information density.
1) This document reports on the development of MeCab, an open-source morphological analyzer for the Japanese language.
2) MeCab uses conditional random fields for part-of-speech tagging and achieved a tagging accuracy of over 99% on test data.
3) The software is open-source and freely available for download under the GPL license along with documentation and a character encoding converter.
1. The document discusses methods for analyzing the relationships between terms in a corpus using measures like co-occurrence weight (cw) and inverse document frequency (idf).
2. It presents formulas for calculating cw, cidf, ctf, and ictf to capture term associations based on frequency of co-occurrence.
3. Tables of term pairs are provided with their calculated measure values to demonstrate the methods. The highest scoring pairs may indicate stronger semantic relations.
1. This document presents an analysis of term weighting methods for information retrieval and text mining.
2. It examines inverse document frequency (idf), collection term frequency (ctf), and co-occurrence weight (cw) as term weighting schemes.
3. The results show that cw, which combines ctf, idf, and co-occurrence information, outperforms other term weighting methods by better representing term importance and relevance to documents.
1. The document summarizes research on analyzing the co-occurrence patterns of words in a large corpus of documents.
2. It finds that the number of high co-occurrence weight patterns between words is much smaller than the number of low co-occurrence weight patterns.
3. The document also presents examples of words that have high and low co-occurrence weights based on an analysis of a corpus of documents.
The document discusses:
1. The development of a thesaurus of classical Japanese poetic vocabulary to better understand the connotations of words over time and how their usage changed.
2. The thesaurus is being developed using materials from the Hachidaishu, eight anthologies of Japanese poetry compiled between 905-2105 CE.
3. The thesaurus development involves processing the poetry data through a tokenizer, code converter, and other tools to extract and categorize the vocabulary terms according to their attributes.
1. The document discusses methods for calculating weights for terms in documents, including term frequency (tf), inverse document frequency (idf), and weighted schemes that combine tf and idf like tfidf.
2. It provides examples of calculating idf values for specific terms and illustrates how idf values increase as terms appear in fewer documents.
3. Tables show ranked lists of term pairs based on their calculated co-occurrence weight (cw) values, which factor in co-occurrence frequency, idf, and co-information density.
1) This document reports on the development of MeCab, an open-source morphological analyzer for the Japanese language.
2) MeCab uses conditional random fields for part-of-speech tagging and achieved a tagging accuracy of over 99% on test data.
3) The software is open-source and freely available for download under the GPL license along with documentation and a character encoding converter.
3. 大阪電気通信大学 2012 3
和歌: Japanese Songs
立田姫
手向くる神の / あればこそ
秋の木の葉の / 幣と散るらめ
because Princess Tatsuta
has a god to whom she offers brocades,
the leaves of trees
in autumn will scatter
as an offering.
兼覧王(?–832)
古今和歌集 298 番歌
7. 大阪電気通信大学 2012 7
問題: 処理単位のサイズが決まっていない!
処理単位のサイズは文脈の意味によってちがう。
• 単位 →卯の花 or 卯/の/花 (中野, 1998)
• 正書法 →さびしい/さみしい/寂しい/淋しい (sad)
• 意味 →卯の花 ∈ plant or 卯の花 ∈ food (unohana = a
deutzia or bean curd refuse)
8. 大阪電気通信大学 2012 8
シソーラスの例: 神 (God)
BG-01-2030-01-030-A-かみ-神
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
21. 大阪電気通信大学 2012 21
(C) 分類番号の形式–3
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田
BG-01-2030-01-101-A -- 姫
-- Tatsuta Tatsuta
-- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
handBG-01-5730-02-010-A -- 手
BG-02-1700-01-040-A -- 向ける
BG-01-2030-01-030-A かみ 神
BG-08-0061-07-010-A の の
BG-02-1200-01-010-C あれ 有り
BG-08-0064-26-010-A ば ば
BG-04-1120-05-150-A -- ば
BG-08-0065-01-010-A こそ こそ
-- te
-- mukeru
kami
no
for
god
SUB (particle)
beare
ba because (particle)
because (reason)
KP (emphasis)
-- ba
koso
Figure 8: BG データベース変換の例
22. 大阪電気通信大学 2012 22
10th century 20th century
Field of experience Field of experience (expert)
poet expert readerwrite OP read
write
CT
read
novice reader
20th century
Field of experience
(novice)
Figure 9: OP と CT の位置づけの整理(まとめ)
25. 大阪電気通信大学 2012 25
OP の成分
Table 2: CT から OP を引いた結果
OP (valid number of element) = 16
E
F
G
T
U
(ratio of exact match)
(ratio of field match)
(ratio of group match)
(ratio of total match)
(ratio of unmatched OP)
12/16 = 0.750
1/16 = 0.062
2/16 = 0.125
15/16 = 0.938
1 - T = 0.062
27. 大阪電気通信大学 2012 27
CT の成分
Table 3: CT の成分: 古今集 298 番歌の小町谷 (1982) による現代語訳: fabs(D-H)
は実験値 D から理論値 H を引いて絶対値で示したもの
CT (valid number of element)
W (ratio of original word use)
A (ratio of annotation)
=41
12/41=0.293(E/CT)
1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of F+G paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added)
H (theoretical value of D)
Gap
0.707-(0.073+0.040)=0.595A-(P1+P2)
1-16/41=0.6101-OP/CT
fabs(0.595-0.610)=0.015fabs(D-H)
28. 大阪電気通信大学 2012 28
差分: CT - OP
P1 3 (7.3%)
P2 1 (4.0%) W 12 (29.3%)
Exact 12 (75.0%)
Unmatched 1 (6.2%)
Group 2 (12.5%)
D 25 (59.5%)
Field 1 (6.2%)
OP : 16 elements CT(298,koma) : 41 elements(298)
Figure 12: OP と CT の成分と対応を示す円グラフ(Pie-charts)