SlideShare a Scribd company logo
Asialex 2011 Kyoto, Japan                                          1



       Development of the Thesaurus of Classical
             Japanese Poetic Vocabulary




                                Hilofumi Yamamoto
                            Tokyo Institute of Technology
                                   Makiro Tanaka
         National Institute of Japanese Language and Linguistics

                                  22nd Aug. 2011
Asialex 2011 Kyoto, Japan                                        2




       Outline
         1. Purpose of Study
              • Connotation of classical poetic vocabulary
              • Longitudinal study of transition of vocabulary
         2. Development of Thesaurus
         3. Applications
Asialex 2011 Kyoto, Japan                                                  3




       Waka: Japanese Poetry




                            Tatsuta-Hime..
                            tamukuru KAMI no / arebakoso
                            aki no konoha no / nusa to chirurame

                            because Princess Tatsuta
                            has a god to whom she offers brocades,
                            the leaves of trees
                            in autumn will scatter
                            as an offering.

                                                 Prince Kanemi
                                                 No. 298 in the Kokinsh¯
                                                                       u
Asialex 2011 Kyoto, Japan                                    4




       Problem: Orthography
                                in Chinese characters

                  in hiragana




                                → All Tatsuta (place name)
Asialex 2011 Kyoto, Japan                                          5




       Problem: Unit size / attribution
       The unit size and meaning of a word depends on a context.
         • unit →           or          (Nakano, 1998)
         • orthography →
           (sad)
         • attributions →         ∈ plant or       ∈ food
            (unohana = a deutzia or bean curd refuse)
Asialex 2011 Kyoto, Japan                                                            6


       An Item of Thesaurus: God

                BG-01-2030-01-030-A-                                    -
                  ↑       ↑        ↑        ↑        ↑      ↑      ↑         ↑
                 (1)     (2)      (3)      (4)      (5)    (6)    (7)       (8)

          Figure 1: Structure of an item of BG database in the case of kami (god):
                    (1) database ID (BG = short-unit general vocabulary);
                    (2) part of speech ID (01 = noun);
                    (3) group ID (2030 = Shinto deities and Buddhas);
                    (4) field ID;
                    (5) exact ID (030 = god);
                    (6) era-flag (A = contemporary, C = classic);
                    (7) Chinese character reading;
                    (8) Chinese character
Asialex 2011 Kyoto, Japan                              7




       Development: Thesaurus, KH, and t2c
         • Thesaurus for classical poetic vocabulary
         • KH (tokenizer)
         • t2c (token to code converter)
Asialex 2011 Kyoto, Japan                                                                                                                   8



        Materials: the Hachidaish¯
                                 u
           • The Hachidaish¯ : eight anthologies compiled by
                             u
             imperial orders during ca. 905–2105.
           • The database: compiled by the National Institute of
             Japanese Literature, Japan.
           • Old texts taken based on Sh¯hobonban version of the
                                        o
             Hachidaish¯u                                                                                                               )
                                              )                                                 )              )    )             )  205
                      05
                        )
                                            51                          )                   0 86           1 24 44              88 (1
                                          •9                          07                                  1      1             1 ¯
                (   •9                (                              0                    (1           ( • ( •1              (1 shu
           u¯                    u¯                                •1                sh
                                                                                       u¯            ¯
                                                                                                     u                     ¯
                                                                                                                           u    n
         sh                   nsh                         u¯
                                                               (
                                                                                 u¯ i             sh shu
                                                                                                            ¯
                                                                                                                        ish oki
      ki
        n
                           se                           sh                     sh
                                                                                                 ¯
                                                                                               yo ika                 za ink
    K
     o
                      G
                          o
                                                  J   ui
                                                      ¯                     G
                                                                              o
                                                                                           K
                                                                                             in h
                                                                                                   S
                                                                                                                    n
                                                                                                                  Se Sh
          46                     56                                   79          38        20          44       17
    ⊲




                      ⊲




                                                  ⊲




                                                                            ⊲



                                                                                        ⊲

                                                                                                 ⊲




                                                                                                             ⊲

                                                                                                                      ⊲
  900                950                   1000                      1050       1100             1150            1200           1250
Asialex 2011 Kyoto, Japan                                                                               9




       Methods: Flowchart of data processing



                                                                                  ing           P
                              e nt                        er sion          o dell          −O
                          opm                           nv              lm              CT
                    sdevel       isat
                                     ion
                                               co d
                                                   e co         ma tica          ction:       isat
                                                                                                  ion
                  pu          en            a-               he              tra            al
             Co r          Tok           Met            Mat              Sub            Visu
         A             B             C              D              E                 F
Asialex 2011 Kyoto, Japan                              10




       Development: Thesaurus, KH, and t2c
         • Thesaurus for classical poetic vocabulary
         • KH (tokenizer)
         • t2c (token to code converter)
Asialex 2011 Kyoto, Japan                                                       11

                  Table 1: An example of input for KH / Gosensh¯ No. 664
                                                               u
         input: 000664
         output:000664
                           (       - :   :   :   :              )
                   (            - : : : : )
                   (        :    : )
                       (        -    :   :   :   :              )
                           (       - :   :   :   :              )
                   (        :    : )
                           (       -   :   :   :   :                )
                  (         :    : )
                  (         :    : )
                  ( : :           )
                  (   :          : )
                ---
                        (        -       :     :       )
                  (   : :            )
                ---
                        (                - :       :        :           :   )
                  (   : :            )
                ---
                    ( : :               )
                  (   -              : : )
                    (    -             :    :   :   :   )
                    ( -              :    :   :   :   )
Asialex 2011 Kyoto, Japan                                                                      12




       Development: Thesaurus

                                                     Thesaurus
                              Tokeniser              code tagger



         Poem Texts               kh                      t2c                    Hachidaishu
                                                                                  Thesaurus

                            add unknown entries             add new thesaurus codes

                            Dictionary            General, Place Name
                                                  Personal Name, etc
                                  (A)                     (B)
Asialex 2011 Kyoto, Japan                                                                    13




       (A) Corpus: Poems (OP)

             KW00029800|A|KANEMI NO ¯=kanemi no ¯
                                    O           o
             KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
                        tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
                        no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
                        aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
                        nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
                        rame[CJR-REAL]/

          Figure 2: Format of the database of a poem: → indicates continuing to the
                    next line without breaks; the first line, which includes |A|, indicates
                    the name of the poet; the second line which includes |B|, indicates
                    the contents of the poem and added information.
Asialex 2011 Kyoto, Japan                                                   14




       (A) Corpus: Translations (CT)
           $A|000298
           $B|                                                         →

           $C|
           $D|                                                         →

           $I|                                                         →
                                                                        →


                            Figure 3: Format of the database of a CT
Asialex 2011 Kyoto, Japan                                                       15




       (B) Tokenisation:
            original text


               ↓
            tokenising
                   /        / / /[     ]/ /    / / /         / / / /   /[   ]
               ↓
            converting into predicative form
                   /        / / /[     ]/ /    / / /         / / / /   /[   ]

                             Figure 4: Tokenisation of poem texts
Asialex 2011 Kyoto, Japan                                                           16




       (C) meta-code conversion
          CH-29-2130-01-010-A                    Tatsutahime   Princess-Tatsuta
          CH-29-0000-14-010-A   --               -- Tatsuta    Tatsuta
          BG-01-2030-01-101-A   --               -- hime       princess
          BG-02-3770-04-080-C                    tamukuru      present(verb)
          BG-01-5730-02-010-A   --               -- te         hand
          BG-02-1700-01-040-A   --               -- mukeru     for
          BG-01-2030-01-030-A                    kami          god
          BG-08-0061-07-010-A                    no            SUB (particle)
          BG-02-1200-01-010-C                    are           be
          BG-08-0064-26-010-A                    ba            because (particle)
          BG-04-1120-05-150-A   --               -- ba         because (reason)
          BG-08-0065-01-010-A                    koso          KP (emphasis)

                        Figure 5: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan                                                            17



       (C) Structure of meta-code-1
                BG-01-2030-01-030-A-                                    -
                  ↑       ↑        ↑        ↑        ↑      ↑      ↑         ↑
                 (1)     (2)      (3)      (4)      (5)    (6)    (7)       (8)

          Figure 6: Structure of an item of BG database in the case of kami (god):
                    (1) database ID (BG = short-unit general vocabulary);
                    (2) part of speech ID (01 = noun);
                    (3) group ID (2030 = Shinto deities and Buddhas);
                    (4) field ID;
                    (5) exact ID (030 = god);
                    (6) era-flag (A = contemporary, C = classic);
                    (7) Chinese character reading;
                    (8) Chinese character
Asialex 2011 Kyoto, Japan                                                    18




       (C) Structure of the meta-code-2
             BG-01-2600-01-020-A (1)     =   BG-01-2610-01-040-A (2)
             yononaka (world)                yo (world)


                                         +   BG-08-0010-01-021-A (3)
                                             no (of)


                                         +   BG-01-1770-01-080-A (4)
                                             naka (inside)



          Figure 7: Structure of an item of the semantic table in the case
                    of a compound word, yononaka (world)
Asialex 2011 Kyoto, Japan                                                           19




       (C) meta-code conversion-3
          CH-29-2130-01-010-A                    Tatsutahime   Princess-Tatsuta
          CH-29-0000-14-010-A   --               -- Tatsuta    Tatsuta
          BG-01-2030-01-101-A   --               -- hime       princess
          BG-02-3770-04-080-C                    tamukuru      present(verb)
          BG-01-5730-02-010-A   --               -- te         hand
          BG-02-1700-01-040-A   --               -- mukeru     for
          BG-01-2030-01-030-A                    kami          god
          BG-08-0061-07-010-A                    no            SUB (particle)
          BG-02-1200-01-010-C                    are           be
          BG-08-0064-26-010-A                    ba            because (particle)
          BG-04-1120-05-150-A   --               -- ba         because (reason)
          BG-08-0065-01-010-A                    koso          KP (emphasis)

                        Figure 8: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan                                                                 20




                             10th century                    20th century
                         Field of experience        Field of experience (expert)


                  poet         write           OP           read       expert reader

                                                         com
                                                             par           write
                                                                e


                                                                           CT


                                                                           read

                                                                       novice reader

                                                                        20th century
                                                                    Field of experience
                                                                          (novice)




                    Figure 9: Schema of relationship between OP and CT
Asialex 2011 Kyoto, Japan                                                   21

           +-------- # of pair
           | +----- value of matching level, exact=17, field=13, group=10
           | | +-- # of POS
           | | |
           | | | # of element of OP ----+        +- # of element of CT
           | | |         element of OP -+ |      | +--- element of CT
           | | |                        | |      | |
           1 17 11                       00 <-> 12        (Tatsutahime)
           2 17 47                       04 <-> 25         (hand)
           3 17 47                       05 <-> 26        (toward)
           4 17 2                        06 <-> 32         (god)
           5 10 61                       07 <-> 33         (SUB)
           6 17 47                       08 <-> 34        (be)
           7 10 64                       09 <-> 35         (because)
           8 17 65                       11 <-> 36        (EM)
           9 17 2                        12 <-> 38         (autumn)
          10 17 71                       13 <-> 39         (CON)
          11 17 2                        14 <-> 40        (leaf of tree)
          12 17 2                        19 <-> 45         (present)
          13 17 61                       20 <-> 46         (CRD)
          14 17 47                       21 <-> 49        (fall)
          15 13 74                       22 <-> 54         (CJR)

                            Figure 10: Example of the matching process
Asialex 2011 Kyoto, Japan                                                              22




        Residual

   CT   (                                )         (                )
   OP   — —— — — — — — — —                         — — — — —— —


   CT   (        )                           ( ) (       )    (           )
   OP   — —                                  [ ]       — —    — — — —



            Figure 11: Example of the matching process in the case of kks 298 in Ko-
                       machiya (1982)
Asialex 2011 Kyoto, Japan                                                        23




       Components of OP
          Table 2: Result of subtracting the elements of OP(298) from those
                   of CT(298, koma): it indicates the ratio of the ingredients
                   of OP(298).
          OP    (valid      number of element)                     =   16
          E     (ratio      of exact match)              12/16     =   0.750
          F     (ratio      of field match)               1/16     =   0.062
          G     (ratio      of group match)               2/16     =   0.125
          T     (ratio      of total match)              15/16     =   0.938
          U     (ratio      of unmatched OP)             1 - T     =   0.062
Asialex 2011 Kyoto, Japan                       24




       Calculation of Residual Rate



                                     P
                            D = 1−        (1)
                                     T
                                     16
                              = 1−        (2)
                                     41
                              = 0.61      (3)
Asialex 2011 Kyoto, Japan                                                                 25




       Components of CT
          Table 3: Component of CT in case of kks 298 by Komachiya (1982):
                   fabs(D-H) stands for the function of the absolute value of the prac-
                   tical value, D, minus the theoretical value, H.

           CT (valid number of element)                       =41
           W (ratio of original word use)                12/41=0.293(E/CT)
           A (ratio of annotation)                     1-0.293=0.707(1-W)
               ---breakdown of the annotation---
               P1(ratio of FG paraphrased)   (0.62+0.12)/0.707=0.073(F+G)/A
               P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
               D (ratio of purely added)   0.707-(0.073+0.040)=0.595A-(P1+P2)
           H (theoretical value of D)                  1-16/41=0.6101-OP/CT
           Gap                               fabs(0.595-0.610)=0.015fabs(D-H)
Asialex 2011 Kyoto, Japan                                                                              26



       Subtraction: CT - OP


                                                                        P1 3 (7.3%)


                                                                  P2 1 (4.0%)           W 12 (29.3%)
                        Exact 12 (75.0%)




                                             Unmatched 1 (6.2%)


                                                                                D 25 (59.5%)
                                           Group 2 (12.5%)


                                     Field 1 (6.2%)



                        OP(298) : 16 elements                            CT(298,koma) : 41 elements



          Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
                     koma)
Asialex 2011 Kyoto, Japan                                                         27




       (E) Mathematical modelling
                                                     √
                cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 )   (4)


                                                N
                                 idf (t) = log                              (5)
                                               df (t)
Asialex 2011 Kyoto, Japan                                                                                                                                                                                                                                                               28
                                                                                                          far treetop high.1
                                                                                                                                               7regret

                         force                                                                                                                          separation


                                                                                                                                 7                       treetop high.3
                                                                                                           go over
                                                                                                                5
                                                                                                                                               10
                                 6                           be heard.1                                                                             7
                                                                                                                                                    4

                                                                                      this morning                     10                                                                                                                    near
                                                                                                                  9
                                                                                                           10

                                  summer mountains
                                                  hear            borrow                                                    Otowa.PN
                                                                            37
                                                                                                                                                                                                                                6
                                                                                           29
                                                                    69           19                               11                                                                                                                                       old age
                                                             11
                                                                                                treetop           20
                                                                                                                            20
             a cry
                                                                                                                                     19
                                          singing voice                                         20
                                                                                                                                                                                                      every morning
                                                                    cuckoo mountain
                         10                              21
                                                                                                                                                                                                                                                                   wear in (my) hair
               8                                                                                                                                                                        stop.vi.1     8                                                6
                                                   39                                                                 110

                                                   14                                             9                   261                                                                                                                                  4
                                 summer midsummer rain                                                                           sing.vi                                                      field
            side     8                              20                                                                                                                                                                                                                   green willow
                                                                                                                                                                                                                                                                             4
                                             12                                                                                                                                                       10
                                                                                          42
                                                                                                             174                                                                           15                          plum
                                                                                                              44                                                145                                                                                                4
                                                                                                                                                                                         17                                         10
                         9                                                                                                                                                                                                               woven hat
                                                  last year                                                                               10
                                                                                                            26               voice                         62
                                                                                                                                                                                                           56
                                                                                                                                                                                                          break off23
                                                                                                                                                                                                                       10
                                                                                                                                                                                                                                                                   6
                                                                                                                                                                                                                                                                            sew.2
                                                                        10
                                                                                          May                                                                                                              22

          mountain cuckoo                                6                                                                                                      10
                                                                                                                                                                         warbler                                                                               7
                                                                    6                                                                                                                                                                                                         6
                                                              9
                                                                                                                                                                                                            35         branch
                                                                                                                                                                                                           88
                                                                                                                                           Tatsuta.PN                         29
                                                                                                                                                                      cry.vi
                                                                                                                                                                       52                  138
                                                                        7                                                                                                                                                                                               hide.vi.2
                                                flutter.2                             8                                                                    10                       30
                                     imperceptibly                                                                                                                                                                spring
                                                                                                                                                           scatter.1
                                                                                                                                                                                   10
                                                                                                                                                                                                flower
                                                                                                                                                                                                 9

                                                                                                                                      10
                                                                                                                                           9
                                                                                                                                                                                   yet.1
                                                        iris.1              reason.1
                                                                   6


                                                                                                                                                                       touch                                    lure
                                                                                                                 stand.vi
                                                                                                                                                                                                                                         4
                                                                                                                                                                                                                                                       send
                                                                                                                             spring haze                                                                                    7

                                                                                                                                                                                                                        5
                                                                                                                                                                                                           4
                                                                                                                                                                         10
                                                                                                                                                                                                                                         fragrance.1


                                                                                                                                                                                                                       attach
                                                                                                                                                                  hand                    guidance.1

                                                                                                      warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
Asialex 2011 Kyoto, Japan                                                       29



       Conclusion
       The thesaurus annotated with meta-codes allows researchers

         1. to identify different orthographies as the same word;

         2. to attach an alternative semantic ID to a word which has the
            same form but has more than one meaning (polysemic word);

         3. to attach meta-codes not only to tokens recognised as a
            single/simple word but also to attach it to a longer size token

         4. to indicate a similarity between tokens.

         5. to detect common or different tokens among more than one text,
            which will tell us the similarities or differences between texts.

         6. to indicate the relative differences between two words in literary
            works.
Asialex 2011 Kyoto, Japan                                    30




       Questions
         • Computer Modelling of Classical Japanese Poetic
           Vocabulary
            http://etymology.jp/waka/poem.cgi
         • Inquiry:
            Hilofumi Yamamoto
            yamagen@ryu.titech.ac.jp
         • Thank you.

More Related Content

Viewers also liked

Ch2006slide
Ch2006slideCh2006slide
Ch2006slide
Hilo Yamamoto
 
Database2010 01slide
Database2010 01slideDatabase2010 01slide
Database2010 01slideHilo Yamamoto
 
Keio slide
Keio slideKeio slide
Keio slide
Hilo Yamamoto
 
Incremental load
Incremental loadIncremental load
Incremental load
Venkat Mandalapu
 
Goiken2007slide
Goiken2007slideGoiken2007slide
Goiken2007slide
Hilo Yamamoto
 
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpegAyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos
 

Viewers also liked (11)

Ch2006slide
Ch2006slideCh2006slide
Ch2006slide
 
Database2010 01slide
Database2010 01slideDatabase2010 01slide
Database2010 01slide
 
Kokken20100303
Kokken20100303Kokken20100303
Kokken20100303
 
Keio slide
Keio slideKeio slide
Keio slide
 
Jinmon2007slide02
Jinmon2007slide02Jinmon2007slide02
Jinmon2007slide02
 
Ch2011slide01
Ch2011slide01Ch2011slide01
Ch2011slide01
 
Incremental load
Incremental loadIncremental load
Incremental load
 
Ch2010slide01
Ch2010slide01Ch2010slide01
Ch2010slide01
 
Ch2008slide01
Ch2008slide01Ch2008slide01
Ch2008slide01
 
Goiken2007slide
Goiken2007slideGoiken2007slide
Goiken2007slide
 
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpegAyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpeg
 

Recently uploaded

What Is The United Airlines Change Name Policy?
What Is The United Airlines Change Name Policy?What Is The United Airlines Change Name Policy?
What Is The United Airlines Change Name Policy?
flyingrules001namech
 
Best leisure recommended travel tips of 2024
Best leisure recommended travel tips of 2024Best leisure recommended travel tips of 2024
Best leisure recommended travel tips of 2024
kdadfarin363
 
What Budget-Friendly Attractions Does San Antonio Offer For Families
What Budget-Friendly Attractions Does San Antonio Offer For FamiliesWhat Budget-Friendly Attractions Does San Antonio Offer For Families
What Budget-Friendly Attractions Does San Antonio Offer For Families
Walking Tours of San Antonio
 
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
eovoam
 
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
yfuwd
 
09 Days Tour To Skardu(By Road): Skardu Ambassador Tours
09 Days Tour To Skardu(By Road): Skardu Ambassador Tours09 Days Tour To Skardu(By Road): Skardu Ambassador Tours
09 Days Tour To Skardu(By Road): Skardu Ambassador Tours
Skardu Ambassador Tours
 
Slovenia Visa for Indians | How to apply
Slovenia Visa for Indians | How to applySlovenia Visa for Indians | How to apply
Slovenia Visa for Indians | How to apply
Triple I Business
 
How Safe Is Manta Ray Night Snorkeling In Kona
How Safe Is Manta Ray Night Snorkeling In KonaHow Safe Is Manta Ray Night Snorkeling In Kona
How Safe Is Manta Ray Night Snorkeling In Kona
Kona Ocean Adventures
 
Inca Trail to Machu Picchu An Unforgettable Adventure
Inca Trail to Machu Picchu An Unforgettable AdventureInca Trail to Machu Picchu An Unforgettable Adventure
Inca Trail to Machu Picchu An Unforgettable Adventure
Xtreme Tourbulencia
 
ghmc zones and circle and why they are needed
ghmc zones and circle and why they are neededghmc zones and circle and why they are needed
ghmc zones and circle and why they are needed
narinav14
 
Colombia Historical Tour - savvytravelers
Colombia Historical Tour - savvytravelersColombia Historical Tour - savvytravelers
Colombia Historical Tour - savvytravelers
Savvy Travelers
 
Southwest Airlines Low Fare Calendar: The Ultimate Guide
Southwest Airlines Low Fare Calendar: The Ultimate GuideSouthwest Airlines Low Fare Calendar: The Ultimate Guide
Southwest Airlines Low Fare Calendar: The Ultimate Guide
i2aanshul
 
Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...
Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...
Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...
AirportCityTerminals Terminals
 
American Airlines Name Change Policy Highlights.pptx
American Airlines Name Change Policy Highlights.pptxAmerican Airlines Name Change Policy Highlights.pptx
American Airlines Name Change Policy Highlights.pptx
edqour001namechange
 
Bahrain Visa For Indians, Complete Process
Bahrain Visa For Indians, Complete ProcessBahrain Visa For Indians, Complete Process
Bahrain Visa For Indians, Complete Process
toolzbuycomaccess
 
一比一原版(毕业证书)马来西亚双威大学毕业证如何办理
一比一原版(毕业证书)马来西亚双威大学毕业证如何办理一比一原版(毕业证书)马来西亚双威大学毕业证如何办理
一比一原版(毕业证书)马来西亚双威大学毕业证如何办理
ucowe
 
What Should You Expect On Austin's History Tour
What Should You Expect On Austin's History TourWhat Should You Expect On Austin's History Tour
What Should You Expect On Austin's History Tour
Walking Tours of Austin
 

Recently uploaded (17)

What Is The United Airlines Change Name Policy?
What Is The United Airlines Change Name Policy?What Is The United Airlines Change Name Policy?
What Is The United Airlines Change Name Policy?
 
Best leisure recommended travel tips of 2024
Best leisure recommended travel tips of 2024Best leisure recommended travel tips of 2024
Best leisure recommended travel tips of 2024
 
What Budget-Friendly Attractions Does San Antonio Offer For Families
What Budget-Friendly Attractions Does San Antonio Offer For FamiliesWhat Budget-Friendly Attractions Does San Antonio Offer For Families
What Budget-Friendly Attractions Does San Antonio Offer For Families
 
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
 
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
 
09 Days Tour To Skardu(By Road): Skardu Ambassador Tours
09 Days Tour To Skardu(By Road): Skardu Ambassador Tours09 Days Tour To Skardu(By Road): Skardu Ambassador Tours
09 Days Tour To Skardu(By Road): Skardu Ambassador Tours
 
Slovenia Visa for Indians | How to apply
Slovenia Visa for Indians | How to applySlovenia Visa for Indians | How to apply
Slovenia Visa for Indians | How to apply
 
How Safe Is Manta Ray Night Snorkeling In Kona
How Safe Is Manta Ray Night Snorkeling In KonaHow Safe Is Manta Ray Night Snorkeling In Kona
How Safe Is Manta Ray Night Snorkeling In Kona
 
Inca Trail to Machu Picchu An Unforgettable Adventure
Inca Trail to Machu Picchu An Unforgettable AdventureInca Trail to Machu Picchu An Unforgettable Adventure
Inca Trail to Machu Picchu An Unforgettable Adventure
 
ghmc zones and circle and why they are needed
ghmc zones and circle and why they are neededghmc zones and circle and why they are needed
ghmc zones and circle and why they are needed
 
Colombia Historical Tour - savvytravelers
Colombia Historical Tour - savvytravelersColombia Historical Tour - savvytravelers
Colombia Historical Tour - savvytravelers
 
Southwest Airlines Low Fare Calendar: The Ultimate Guide
Southwest Airlines Low Fare Calendar: The Ultimate GuideSouthwest Airlines Low Fare Calendar: The Ultimate Guide
Southwest Airlines Low Fare Calendar: The Ultimate Guide
 
Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...
Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...
Frontier Airlines at Boston Logan International Airport (BOS) Comprehensive G...
 
American Airlines Name Change Policy Highlights.pptx
American Airlines Name Change Policy Highlights.pptxAmerican Airlines Name Change Policy Highlights.pptx
American Airlines Name Change Policy Highlights.pptx
 
Bahrain Visa For Indians, Complete Process
Bahrain Visa For Indians, Complete ProcessBahrain Visa For Indians, Complete Process
Bahrain Visa For Indians, Complete Process
 
一比一原版(毕业证书)马来西亚双威大学毕业证如何办理
一比一原版(毕业证书)马来西亚双威大学毕业证如何办理一比一原版(毕业证书)马来西亚双威大学毕业证如何办理
一比一原版(毕业证书)马来西亚双威大学毕业证如何办理
 
What Should You Expect On Austin's History Tour
What Should You Expect On Austin's History TourWhat Should You Expect On Austin's History Tour
What Should You Expect On Austin's History Tour
 

Asialex201103slide02

  • 1. Asialex 2011 Kyoto, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology Makiro Tanaka National Institute of Japanese Language and Linguistics 22nd Aug. 2011
  • 2. Asialex 2011 Kyoto, Japan 2 Outline 1. Purpose of Study • Connotation of classical poetic vocabulary • Longitudinal study of transition of vocabulary 2. Development of Thesaurus 3. Applications
  • 3. Asialex 2011 Kyoto, Japan 3 Waka: Japanese Poetry Tatsuta-Hime.. tamukuru KAMI no / arebakoso aki no konoha no / nusa to chirurame because Princess Tatsuta has a god to whom she offers brocades, the leaves of trees in autumn will scatter as an offering. Prince Kanemi No. 298 in the Kokinsh¯ u
  • 4. Asialex 2011 Kyoto, Japan 4 Problem: Orthography in Chinese characters in hiragana → All Tatsuta (place name)
  • 5. Asialex 2011 Kyoto, Japan 5 Problem: Unit size / attribution The unit size and meaning of a word depends on a context. • unit → or (Nakano, 1998) • orthography → (sad) • attributions → ∈ plant or ∈ food (unohana = a deutzia or bean curd refuse)
  • 6. Asialex 2011 Kyoto, Japan 6 An Item of Thesaurus: God BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 1: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 7. Asialex 2011 Kyoto, Japan 7 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 8. Asialex 2011 Kyoto, Japan 8 Materials: the Hachidaish¯ u • The Hachidaish¯ : eight anthologies compiled by u imperial orders during ca. 905–2105. • The database: compiled by the National Institute of Japanese Literature, Japan. • Old texts taken based on Sh¯hobonban version of the o Hachidaish¯u ) ) ) ) ) ) 205 05 ) 51 ) 0 86 1 24 44 88 (1 •9 07 1 1 1 ¯ ( •9 ( 0 (1 ( • ( •1 (1 shu u¯ u¯ •1 sh u¯ ¯ u ¯ u n sh nsh u¯ ( u¯ i sh shu ¯ ish oki ki n se sh sh ¯ yo ika za ink K o G o J ui ¯ G o K in h S n Se Sh 46 56 79 38 20 44 17 ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ 900 950 1000 1050 1100 1150 1200 1250
  • 9. Asialex 2011 Kyoto, Japan 9 Methods: Flowchart of data processing ing P e nt er sion o dell −O opm nv lm CT sdevel isat ion co d e co ma tica ction: isat ion pu en a- he tra al Co r Tok Met Mat Sub Visu A B C D E F
  • 10. Asialex 2011 Kyoto, Japan 10 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 11. Asialex 2011 Kyoto, Japan 11 Table 1: An example of input for KH / Gosensh¯ No. 664 u input: 000664 output:000664 ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( : : ) ( : : ) ( : : ) ( : : ) --- ( - : : ) ( : : ) --- ( - : : : : ) ( : : ) --- ( : : ) ( - : : ) ( - : : : : ) ( - : : : : )
  • 12. Asialex 2011 Kyoto, Japan 12 Development: Thesaurus Thesaurus Tokeniser code tagger Poem Texts kh t2c Hachidaishu Thesaurus add unknown entries add new thesaurus codes Dictionary General, Place Name Personal Name, etc (A) (B)
  • 13. Asialex 2011 Kyoto, Japan 13 (A) Corpus: Poems (OP) KW00029800|A|KANEMI NO ¯=kanemi no ¯ O o KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→ tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→ no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→ aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→ nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→ rame[CJR-REAL]/ Figure 2: Format of the database of a poem: → indicates continuing to the next line without breaks; the first line, which includes |A|, indicates the name of the poet; the second line which includes |B|, indicates the contents of the poem and added information.
  • 14. Asialex 2011 Kyoto, Japan 14 (A) Corpus: Translations (CT) $A|000298 $B| → $C| $D| → $I| → → Figure 3: Format of the database of a CT
  • 15. Asialex 2011 Kyoto, Japan 15 (B) Tokenisation: original text ↓ tokenising / / / /[ ]/ / / / / / / / / /[ ] ↓ converting into predicative form / / / /[ ]/ / / / / / / / / /[ ] Figure 4: Tokenisation of poem texts
  • 16. Asialex 2011 Kyoto, Japan 16 (C) meta-code conversion CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 5: Meta-code conversion in case of OP
  • 17. Asialex 2011 Kyoto, Japan 17 (C) Structure of meta-code-1 BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 6: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 18. Asialex 2011 Kyoto, Japan 18 (C) Structure of the meta-code-2 BG-01-2600-01-020-A (1) = BG-01-2610-01-040-A (2) yononaka (world) yo (world) + BG-08-0010-01-021-A (3) no (of) + BG-01-1770-01-080-A (4) naka (inside) Figure 7: Structure of an item of the semantic table in the case of a compound word, yononaka (world)
  • 19. Asialex 2011 Kyoto, Japan 19 (C) meta-code conversion-3 CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 8: Meta-code conversion in case of OP
  • 20. Asialex 2011 Kyoto, Japan 20 10th century 20th century Field of experience Field of experience (expert) poet write OP read expert reader com par write e CT read novice reader 20th century Field of experience (novice) Figure 9: Schema of relationship between OP and CT
  • 21. Asialex 2011 Kyoto, Japan 21 +-------- # of pair | +----- value of matching level, exact=17, field=13, group=10 | | +-- # of POS | | | | | | # of element of OP ----+ +- # of element of CT | | | element of OP -+ | | +--- element of CT | | | | | | | 1 17 11 00 <-> 12 (Tatsutahime) 2 17 47 04 <-> 25 (hand) 3 17 47 05 <-> 26 (toward) 4 17 2 06 <-> 32 (god) 5 10 61 07 <-> 33 (SUB) 6 17 47 08 <-> 34 (be) 7 10 64 09 <-> 35 (because) 8 17 65 11 <-> 36 (EM) 9 17 2 12 <-> 38 (autumn) 10 17 71 13 <-> 39 (CON) 11 17 2 14 <-> 40 (leaf of tree) 12 17 2 19 <-> 45 (present) 13 17 61 20 <-> 46 (CRD) 14 17 47 21 <-> 49 (fall) 15 13 74 22 <-> 54 (CJR) Figure 10: Example of the matching process
  • 22. Asialex 2011 Kyoto, Japan 22 Residual CT ( ) ( ) OP — —— — — — — — — — — — — — —— — CT ( ) ( ) ( ) ( ) OP — — [ ] — — — — — — Figure 11: Example of the matching process in the case of kks 298 in Ko- machiya (1982)
  • 23. Asialex 2011 Kyoto, Japan 23 Components of OP Table 2: Result of subtracting the elements of OP(298) from those of CT(298, koma): it indicates the ratio of the ingredients of OP(298). OP (valid number of element) = 16 E (ratio of exact match) 12/16 = 0.750 F (ratio of field match) 1/16 = 0.062 G (ratio of group match) 2/16 = 0.125 T (ratio of total match) 15/16 = 0.938 U (ratio of unmatched OP) 1 - T = 0.062
  • 24. Asialex 2011 Kyoto, Japan 24 Calculation of Residual Rate P D = 1− (1) T 16 = 1− (2) 41 = 0.61 (3)
  • 25. Asialex 2011 Kyoto, Japan 25 Components of CT Table 3: Component of CT in case of kks 298 by Komachiya (1982): fabs(D-H) stands for the function of the absolute value of the prac- tical value, D, minus the theoretical value, H. CT (valid number of element) =41 W (ratio of original word use) 12/41=0.293(E/CT) A (ratio of annotation) 1-0.293=0.707(1-W) ---breakdown of the annotation--- P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2) H (theoretical value of D) 1-16/41=0.6101-OP/CT Gap fabs(0.595-0.610)=0.015fabs(D-H)
  • 26. Asialex 2011 Kyoto, Japan 26 Subtraction: CT - OP P1 3 (7.3%) P2 1 (4.0%) W 12 (29.3%) Exact 12 (75.0%) Unmatched 1 (6.2%) D 25 (59.5%) Group 2 (12.5%) Field 1 (6.2%) OP(298) : 16 elements CT(298,koma) : 41 elements Figure 12: Pie-charts illustrating the components of OP(298) and CT(298, koma)
  • 27. Asialex 2011 Kyoto, Japan 27 (E) Mathematical modelling √ cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 ) (4) N idf (t) = log (5) df (t)
  • 28. Asialex 2011 Kyoto, Japan 28 far treetop high.1 7regret force separation 7 treetop high.3 go over 5 10 6 be heard.1 7 4 this morning 10 near 9 10 summer mountains hear borrow Otowa.PN 37 6 29 69 19 11 old age 11 treetop 20 20 a cry 19 singing voice 20 every morning cuckoo mountain 10 21 wear in (my) hair 8 stop.vi.1 8 6 39 110 14 9 261 4 summer midsummer rain sing.vi field side 8 20 green willow 4 12 10 42 174 15 plum 44 145 4 17 10 9 woven hat last year 10 26 voice 62 56 break off23 10 6 sew.2 10 May 22 mountain cuckoo 6 10 warbler 7 6 6 9 35 branch 88 Tatsuta.PN 29 cry.vi 52 138 7 hide.vi.2 flutter.2 8 10 30 imperceptibly spring scatter.1 10 flower 9 10 9 yet.1 iris.1 reason.1 6 touch lure stand.vi 4 send spring haze 7 5 4 10 fragrance.1 attach hand guidance.1 warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
  • 29. Asialex 2011 Kyoto, Japan 29 Conclusion The thesaurus annotated with meta-codes allows researchers 1. to identify different orthographies as the same word; 2. to attach an alternative semantic ID to a word which has the same form but has more than one meaning (polysemic word); 3. to attach meta-codes not only to tokens recognised as a single/simple word but also to attach it to a longer size token 4. to indicate a similarity between tokens. 5. to detect common or different tokens among more than one text, which will tell us the similarities or differences between texts. 6. to indicate the relative differences between two words in literary works.
  • 30. Asialex 2011 Kyoto, Japan 30 Questions • Computer Modelling of Classical Japanese Poetic Vocabulary http://etymology.jp/waka/poem.cgi • Inquiry: Hilofumi Yamamoto yamagen@ryu.titech.ac.jp • Thank you.