Asialex201103slide02

Asialex 2011 Kyoto, Japan 1

Development of the Thesaurus of Classical
Japanese Poetic Vocabulary

Hilofumi Yamamoto
Tokyo Institute of Technology
Makiro Tanaka
National Institute of Japanese Language and Linguistics

22nd Aug. 2011


Outline
1. Purpose of Study
• Connotation of classical poetic vocabulary
• Longitudinal study of transition of vocabulary
2. Development of Thesaurus
3. Applications


Waka: Japanese Poetry

Tatsuta-Hime..
tamukuru KAMI no / arebakoso
aki no konoha no / nusa to chirurame

because Princess Tatsuta
has a god to whom she oﬀers brocades,
the leaves of trees
in autumn will scatter
as an oﬀering.

Prince Kanemi
No. 298 in the Kokinsh¯
u


Problem: Orthography
in Chinese characters

in hiragana

→ All Tatsuta (place name)


Problem: Unit size / attribution
The unit size and meaning of a word depends on a context.
• unit → or (Nakano, 1998)
• orthography →
(sad)
• attributions → ∈ plant or ∈ food
(unohana = a deutzia or bean curd refuse)


An Item of Thesaurus: God

BG-01-2030-01-030-A- -
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)

Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) ﬁeld ID;
(5) exact ID (030 = god);
(6) era-ﬂag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character


Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)


Materials: the Hachidaish¯
u
• The Hachidaish¯ : eight anthologies compiled by
u
imperial orders during ca. 905–2105.
• The database: compiled by the National Institute of
Japanese Literature, Japan.
• Old texts taken based on Sh¯hobonban version of the
o
Hachidaish¯u )
) ) ) ) ) 205
05
)
51 ) 0 86 1 24 44 88 (1
•9 07 1 1 1 ¯
( •9 ( 0 (1 ( • ( •1 (1 shu
u¯ u¯ •1 sh
u¯ ¯
u ¯
u n
sh nsh u¯
(
u¯ i sh shu
¯
ish oki
ki
n
se sh sh
¯
yo ika za ink
K
o
G
o
J ui
¯ G
o
K
in h
S
n
Se Sh
46 56 79 38 20 44 17
⊲

⊲

⊲

⊲

⊲

⊲

⊲

⊲
900 950 1000 1050 1100 1150 1200 1250


Methods: Flowchart of data processing

ing P
e nt er sion o dell −O
opm nv lm CT
sdevel isat
ion
co d
e co ma tica ction: isat
ion
pu en a- he tra al
Co r Tok Met Mat Sub Visu
A B C D E F


Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)


Table 1: An example of input for KH / Gosensh¯ No. 664
u
input: 000664
output:000664
( - : : : : )
( - : : : : )
( : : )
( - : : : : )
( - : : : : )
( : : )
( - : : : : )
( : : )
( : : )
( : : )
( : : )
---
( - : : )
( : : )
---
( - : : : : )
( : : )
---
( : : )
( - : : )
( - : : : : )
( - : : : : )


Development: Thesaurus

Thesaurus
Tokeniser code tagger

Poem Texts kh t2c Hachidaishu
Thesaurus

add unknown entries add new thesaurus codes

Dictionary General, Place Name
Personal Name, etc
(A) (B)


(A) Corpus: Poems (OP)

KW00029800|A|KANEMI NO ¯=kanemi no ¯
O o
KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
rame[CJR-REAL]/

Figure 2: Format of the database of a poem: → indicates continuing to the
next line without breaks; the ﬁrst line, which includes |A|, indicates
the name of the poet; the second line which includes |B|, indicates
the contents of the poem and added information.


(A) Corpus: Translations (CT)
$A|000298
$B| →

$C|
$D| →

$I| →
→

Figure 3: Format of the database of a CT


(B) Tokenisation:
original text

↓
tokenising
/ / / /[ ]/ / / / / / / / / /[ ]
↓
converting into predicative form
/ / / /[ ]/ / / / / / / / / /[ ]

Figure 4: Tokenisation of poem texts


(C) meta-code conversion
CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- -- hime princess
BG-02-3770-04-080-C tamukuru present(verb)
BG-01-5730-02-010-A -- -- te hand
BG-02-1700-01-040-A -- -- mukeru for
BG-01-2030-01-030-A kami god
BG-08-0061-07-010-A no SUB (particle)
BG-02-1200-01-010-C are be
BG-08-0064-26-010-A ba because (particle)
BG-04-1120-05-150-A -- -- ba because (reason)
BG-08-0065-01-010-A koso KP (emphasis)

Figure 5: Meta-code conversion in case of OP


(C) Structure of meta-code-1
BG-01-2030-01-030-A- -
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)

Figure 6: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) ﬁeld ID;
(5) exact ID (030 = god);
(6) era-ﬂag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character


(C) Structure of the meta-code-2
BG-01-2600-01-020-A (1) = BG-01-2610-01-040-A (2)
yononaka (world) yo (world)

+ BG-08-0010-01-021-A (3)
no (of)

+ BG-01-1770-01-080-A (4)
naka (inside)

Figure 7: Structure of an item of the semantic table in the case
of a compound word, yononaka (world)


(C) meta-code conversion-3
CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- -- hime princess
BG-02-3770-04-080-C tamukuru present(verb)
BG-01-5730-02-010-A -- -- te hand
BG-02-1700-01-040-A -- -- mukeru for
BG-01-2030-01-030-A kami god
BG-08-0061-07-010-A no SUB (particle)
BG-02-1200-01-010-C are be
BG-08-0064-26-010-A ba because (particle)
BG-04-1120-05-150-A -- -- ba because (reason)
BG-08-0065-01-010-A koso KP (emphasis)

Figure 8: Meta-code conversion in case of OP


10th century 20th century
Field of experience Field of experience (expert)

poet write OP read expert reader

com
par write
e

CT

read

novice reader

20th century
Field of experience
(novice)

Figure 9: Schema of relationship between OP and CT


+-------- # of pair
| +----- value of matching level, exact=17, field=13, group=10
| | +-- # of POS
| | |
| | | # of element of OP ----+ +- # of element of CT
| | | element of OP -+ | | +--- element of CT
| | | | | | |
1 17 11 00 <-> 12 (Tatsutahime)
2 17 47 04 <-> 25 (hand)
3 17 47 05 <-> 26 (toward)
4 17 2 06 <-> 32 (god)
5 10 61 07 <-> 33 (SUB)
6 17 47 08 <-> 34 (be)
7 10 64 09 <-> 35 (because)
8 17 65 11 <-> 36 (EM)
9 17 2 12 <-> 38 (autumn)
10 17 71 13 <-> 39 (CON)
11 17 2 14 <-> 40 (leaf of tree)
12 17 2 19 <-> 45 (present)
13 17 61 20 <-> 46 (CRD)
14 17 47 21 <-> 49 (fall)
15 13 74 22 <-> 54 (CJR)

Figure 10: Example of the matching process


Residual

CT ( ) ( )
OP — —— — — — — — — — — — — — —— —

CT ( ) ( ) ( ) ( )
OP — — [ ] — — — — — —

Figure 11: Example of the matching process in the case of kks 298 in Ko-
machiya (1982)


Components of OP
Table 2: Result of subtracting the elements of OP(298) from those
of CT(298, koma): it indicates the ratio of the ingredients
of OP(298).
OP (valid number of element) = 16
E (ratio of exact match) 12/16 = 0.750
F (ratio of field match) 1/16 = 0.062
G (ratio of group match) 2/16 = 0.125
T (ratio of total match) 15/16 = 0.938
U (ratio of unmatched OP) 1 - T = 0.062


Calculation of Residual Rate

P
D = 1− (1)
T
16
= 1− (2)
41
= 0.61 (3)


Components of CT
Table 3: Component of CT in case of kks 298 by Komachiya (1982):
fabs(D-H) stands for the function of the absolute value of the prac-
tical value, D, minus the theoretical value, H.

CT (valid number of element) =41
W (ratio of original word use) 12/41=0.293(E/CT)
A (ratio of annotation) 1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)
H (theoretical value of D) 1-16/41=0.6101-OP/CT
Gap fabs(0.595-0.610)=0.015fabs(D-H)


Subtraction: CT - OP

P1 3 (7.3%)

P2 1 (4.0%) W 12 (29.3%)
Exact 12 (75.0%)

Unmatched 1 (6.2%)

D 25 (59.5%)
Group 2 (12.5%)

Field 1 (6.2%)

OP(298) : 16 elements CT(298,koma) : 41 elements

Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
koma)


(E) Mathematical modelling
√
cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 ) (4)

N
idf (t) = log (5)
df (t)

far treetop high.1
7regret

force separation

7 treetop high.3
go over
5
10
6 be heard.1 7
4

this morning 10 near
9
10

summer mountains
hear borrow Otowa.PN
37
6
29
69 19 11 old age
11
treetop 20
20
a cry
19
singing voice 20
every morning
cuckoo mountain
10 21
wear in (my) hair
8 stop.vi.1 8 6
39 110

14 9 261 4
summer midsummer rain sing.vi field
side 8 20 green willow
4
12 10
42
174 15 plum
44 145 4
17 10
9 woven hat
last year 10
26 voice 62
56
break off23
10
6
sew.2
10
May 22

mountain cuckoo 6 10
warbler 7
6 6
9
35 branch
88
Tatsuta.PN 29
cry.vi
52 138
7 hide.vi.2
flutter.2 8 10 30
imperceptibly spring
scatter.1
10
flower
9

10
9
yet.1
iris.1 reason.1
6

touch lure
stand.vi
4
send
spring haze 7

5
4
10
fragrance.1

attach
hand guidance.1

warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16


Conclusion
The thesaurus annotated with meta-codes allows researchers

1. to identify different orthographies as the same word;

2. to attach an alternative semantic ID to a word which has the
same form but has more than one meaning (polysemic word);

3. to attach meta-codes not only to tokens recognised as a
single/simple word but also to attach it to a longer size token

4. to indicate a similarity between tokens.

5. to detect common or different tokens among more than one text,
which will tell us the similarities or differences between texts.

6. to indicate the relative differences between two words in literary
works.


Questions
• Computer Modelling of Classical Japanese Poetic
Vocabulary
http://etymology.jp/waka/poem.cgi
• Inquiry:
Hilofumi Yamamoto
yamagen@ryu.titech.ac.jp
• Thank you.

Asialex201103slide02

More Related Content

Viewers also liked

Similar to Asialex201103slide02

Asialex201103slide02