Tokyotech20130715

  • 76 views
Uploaded on

Tokyo Tech Linguistics Round Table Seminar, 2013, July, 15th. …

Tokyo Tech Linguistics Round Table Seminar, 2013, July, 15th.
We address the development of the thesaurus of classical Japanese poetic vocabulary using a technology of mathematical modelling.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
76
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology 15th July 2013
  • 2. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 2 Outline 1. Purpose of Study • Connotation of classical poetic vocabulary • Longitudinal study of transition of vocabulary 2. Development of Thesaurus 3. Applications
  • 3. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 3 Waka: Japanese Poetry Tatsuta-Hime.. tamukuru KAMI no / arebakoso aki no konoha no / nusa to chirurame because Princess Tatsuta has a god to whom she offers brocades, the leaves of trees in autumn will scatter as an offering. Prince Kanemi No. 298 in the Kokinsh¯u
  • 4. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 4 Problem: Orthography in hiragana たつた in Chinese characters 立田 竜田 龍田 → All Tatsuta (place name)
  • 5. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 5 Problem: Unit size / attribution The unit size and meaning of a word depends on a context. • unit → 卯の花 or 卯/の/花 (Nakano, 1998) • orthography → さびしい/さみしい/寂しい/淋しい (sad) • attributions → 卯の花 ∈ plant or 卯の花 ∈ food (unohana = a deutzia or bean curd refuse)
  • 6. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 6 An Item of Thesaurus: God BG-01-2030-01-030-A-かみ-神 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 1: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 7. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 7 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 8. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 8 Materials: the Hachidaish¯u • The Hachidaish¯u: eight anthologies compiled by imperial orders during ca. 905–2105. • The database: compiled by the National Institute of Japanese Literature, Japan. • Old texts taken based on Sh¯ohobonban version of the Hachidaish¯u 900 ⊲ K okinsh¯u (•905) 46 950 ⊲ G osensh¯u (•951) 56 1000 ⊲ J¯uish¯u (•1007) 79 1050 ⊲ G osh¯uish¯u (1086) 38 1100 ⊲ K iny¯osh¯u (•1124) 20 ⊲ Shikash¯u (•1144) 44 1150 ⊲ Senzaish¯u (1188) 17 1200 ⊲ Shinkokinsh¯u (1205) 1250
  • 9. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 9 Methods: Flowchart of data processing A Corpus development B Tokenisation C Meta-code conversion D Mathematical modelling E Subtraction: CT − OP F Visualisation
  • 10. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 10 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 11. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 11 Table 1: An example of input for KH / Gosensh¯u No. 664 input: 000664 わすられて思ふなげきのしげるをや身をはづかしのもりといふらん output:000664 わすら (ラ四-未:忘る:わする:忘ら:わすら) れ (自可受-用:る:る:れ:れ) て (接助:て:て) 思ふ (ハ四-終体:思ふ:おもふ:思ふ:おもふ) なげき (カ四-用:嘆く:なげく:嘆き:なげき) の (格助:の:の) しげる (ラ四-終体:茂る:しげる:茂る:しげる) を (*助:を:を) や (係助:や:や) 身 (名:身:み) を (*助:を:を) --- はづかし (名-地名:羽束師:はづかし) の (格助:の:の) --- はづかし (形シク-終:恥づかし:はづかし:恥づかし:はづかし) の (格助:の:の) --- もり (名:森:もり) と (格助-引用:と:と) いふ (ハ四-終体:言ふ:いふ:言ふ:いふ) らん (推-終体:らむ:らむ:らむ:らむ)
  • 12. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 12 Development: Thesaurus Poem Texts kh t2c Thesaurus code taggerTokeniser Hachidaishu Thesaurus (A) (B) add new thesaurus codes Dictionary General, Place Name Personal Name, etc add unknown entries
  • 13. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 13 (A) Corpus: Poems (OP) KW00029800|A|KANEMI NO ¯O=kanemi no ¯o KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→ tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→ no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→ aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→ nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→ rame[CJR-REAL]/ Figure 2: Format of the database of a poem: → indicates continuing to the next line without breaks; the first line, which includes |A|, indicates the name of the poet; the second line which includes |B|, indicates the contents of the poem and added information.
  • 14. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 14 (A) Corpus: Translations (CT) $A|000298 $B|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け → をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。 $C|秋の歌 $D|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け → をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。 $I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう → のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ → となってちっているのだろう。 Figure 3: Format of the database of a CT
  • 15. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 15 (B) Tokenisation: original text 立田姫手向ける神の有ればこそ秋の木の葉の幣と散るらめ ↓ tokenising 立田姫/手向ける/神/の/[有れ]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らめ] ↓ converting into predicative form 立田姫/手向ける/神/の/[有り]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らむ] Figure 4: Tokenisation of poem texts
  • 16. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 16 (C) meta-code conversion CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta BG-01-2030-01-101-A -- 姫 -- hime princess BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb) BG-01-5730-02-010-A -- 手 -- te hand BG-02-1700-01-040-A -- 向ける -- mukeru for BG-01-2030-01-030-A かみ 神 kami god BG-08-0061-07-010-A の の no SUB (particle) BG-02-1200-01-010-C あれ 有り are be BG-08-0064-26-010-A ば ば ba because (particle) BG-04-1120-05-150-A -- ば -- ba because (reason) BG-08-0065-01-010-A こそ こそ koso KP (emphasis) Figure 5: Meta-code conversion in case of OP
  • 17. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 17 (C) Structure of meta-code-1 BG-01-2030-01-030-A-かみ-神 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 6: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 18. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 18 (C) Structure of the meta-code-2 BG-01-2600-01-020-A yononaka (world) (1) = BG-01-2610-01-040-A yo (world) (2) + BG-08-0010-01-021-A no (of) (3) + BG-01-1770-01-080-A naka (inside) (4) Figure 7: Structure of an item of the semantic table in the case of a compound word, yononaka (world)
  • 19. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 19 (C) meta-code conversion-3 CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta BG-01-2030-01-101-A -- 姫 -- hime princess BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb) BG-01-5730-02-010-A -- 手 -- te hand BG-02-1700-01-040-A -- 向ける -- mukeru for BG-01-2030-01-030-A かみ 神 kami god BG-08-0061-07-010-A の の no SUB (particle) BG-02-1200-01-010-C あれ 有り are be BG-08-0064-26-010-A ば ば ba because (particle) BG-04-1120-05-150-A -- ば -- ba because (reason) BG-08-0065-01-010-A こそ こそ koso KP (emphasis) Figure 8: Meta-code conversion in case of OP
  • 20. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 20 poet write OP read expert reader write CT read novice reader compare 10th century Field of experience 20th century Field of experience (expert) 20th century Field of experience (novice) Figure 9: Schema of relationship between OP and CT
  • 21. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 21 +-------- # of pair | +----- value of matching level, exact=17, field=13, group=10 | | +-- # of POS | | | | | | # of element of OP ----+ +- # of element of CT | | | element of OP -+ | | +--- element of CT | | | | | | | 1 17 11 立田姫 00 <-> 12 龍田姫 (Tatsutahime) 2 17 47 手 04 <-> 25 手 (hand) 3 17 47 向ける 05 <-> 26 向ける (toward) 4 17 2 神 06 <-> 32 神 (god) 5 10 61 の 07 <-> 33 が (SUB) 6 17 47 有り 08 <-> 34 ある (be) 7 10 64 ば 09 <-> 35 から (because) 8 17 65 こそ 11 <-> 36 こそ (EM) 9 17 2 秋 12 <-> 38 秋 (autumn) 10 17 71 の 13 <-> 39 の (CON) 11 17 2 木の葉 14 <-> 40 木の葉 (leaf of tree) 12 17 2 幣 19 <-> 45 幣 (present) 13 17 61 と 20 <-> 46 と (CRD) 14 17 47 散る 21 <-> 49 散る (fall) 15 13 74 らむ 22 <-> 54 う (CJR) Figure 10: Example of the matching process
  • 22. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 22 Residual CT (秋の末近くなって帰り道についた)龍田姫(が道中の無事を願って)手 向け OP — —— — — — — — — — 立田姫 — — — — — — — 手向ける CT (をする)神があるからこそ秋の木の葉(が)幣(となって)散っ(ているのだろ) う OP — — 神のあれ ば こそ秋の木の葉[の]幣 と — — 散る — — — — らめ Figure 11: Example of the matching process in the case of kks 298 in Ko- machiya (1982)
  • 23. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 23 Components of OP Table 2: Result of subtracting the elements of OP(298) from those of CT(298, koma): it indicates the ratio of the ingredients of OP(298). OP (valid number of element) = 16 E (ratio of exact match) 12/16 = 0.750 F (ratio of field match) 1/16 = 0.062 G (ratio of group match) 2/16 = 0.125 T (ratio of total match) 15/16 = 0.938 U (ratio of unmatched OP) 1 - T = 0.062
  • 24. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 24 Calculation of Residual Rate D = 1 − P T (1) = 1 − 16 41 (2) = 0.61 (3)
  • 25. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 25 Components of CT Table 3: Component of CT in case of kks 298 by Komachiya (1982): fabs(D-H) stands for the function of the absolute value of the prac- tical value, D, minus the theoretical value, H. CT (valid number of element) =41 W (ratio of original word use) 12/41=0.293(E/CT) A (ratio of annotation) 1-0.293=0.707(1-W) ---breakdown of the annotation--- P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2) H (theoretical value of D) 1-16/41=0.6101-OP/CT Gap fabs(0.595-0.610)=0.015fabs(D-H)
  • 26. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 26 Subtraction: CT - OP Exact 12 (75.0%) Field 1 (6.2%) Group 2 (12.5%) Unmatched 1 (6.2%) W 12 (29.3%) P1 3 (7.3%) P2 1 (4.0%) D 25 (59.5%) OP : 16 elements CT : 41 elements(298) (298,koma) Figure 12: Pie-charts illustrating the components of OP(298) and CT(298, koma)
  • 27. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 27 (E) Mathematical modelling cw(t1, t2)=(1+log ctf(t1, t2)) √ idf(t1) idf(t2) (4) idf(t) = log N df(t) (5)
  • 28. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 28 warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16 every morning field 8 warbler 17 old age woven hat 6 10 green willow 4 wear in (my) hair 4 sew.26 spring 88 10 Tatsuta.PN 10 branch35 flower 138 stop.vi.1 15 break off 22 cry.vi 29 sing.vi 145 yet.1 30 summer side 8 cuckoo39 a cry 8 May 42 Otowa.PN 20 voice 174 mountain110 261 singing voice 21 midsummer rain14 hear 69 be heard.1 37 last year 10 iris.1 7 treetop 9 12 20 20 11 this morning 29 9 19 go over 10 regret 10 treetop high.3 4 10 near 6 6226 reason.1 8 6 guidance.1 lure 4 9 send 4 separation 7 4 fragrance.1 7 20 10 spring haze 9 stand.vi 10 summer mountains 11 force 6 plum 10 56 23 44 mountain cuckoo 9 hide.vi.2 7 6 10 scatter.1 52 10 touch 10 hand 10 attach 5 flutter.2 6 6 borrow 19 imperceptibly 9 treetop high.1 7 7 far 5
  • 29. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 29 Conclusion The thesaurus annotated with meta-codes allows researchers 1. to identify different orthographies as the same word; 2. to attach an alternative semantic ID to a word which has the same form but has more than one meaning (polysemic word); 3. to attach meta-codes not only to tokens recognised as a single/simple word but also to attach it to a longer size token 4. to indicate a similarity between tokens. 5. to detect common or different tokens among more than one text, which will tell us the similarities or differences between texts. 6. to indicate the relative differences between two words in literary works.
  • 30. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 30 Questions • Computer Modelling of Classical Japanese Poetic Vocabulary http://warbler.ryu.titech.ac.jp/waka/poem.cgi • Inquiry: Hilofumi Yamamoto yamagen@ryu.titech.ac.jp • Thank you.