1




Hilofumi Yamamoto




   June 4, 2008
2




•
    –
    –
    –
    – 1000


•                    (Goodenough, 1981)
                 (
             )
3




•   —
•
•
•


•
4




•
•
•
•
5




•   (2005)—
•   (2006)—
•
•
•
•
6




•
•
•           (   , 1983;   , 1989)
    → p.2

•
7




                                                                                                          )                                       )
                        )                        )                       07
                                                                           )                            86            4) 44)                 ) 205
                      05                       51                       0                            (1
                                                                                                       0             2
                                                                                                                   11 •11              18 (1
                                                                                                                                            8
                (   •9                   (   •9                   (   •1                                        (•     (            (1
            8                      q=
                                     8                        8                        d   =8                =8                          =8
       :# =                      e@                       0d=                      =&0                     MU l2V=8            =8 78E:#
 8   E                       8                         =&                       8e                      6b      ;           @i: ?
                46                            56                          79                    38            20          44       17
 ¡




                             ¡




                                                       ¡




                                                                                ¡



                                                                                                        ¡

                                                                                                                   ¡




                                                                                                                               ¡

                                                                                                                                        ¡
900                         950                      1000                1050       1100                           1150            1200         1250
8




1.
2.         (1976)
     •
     •
3.         (1991)


4.       (1998)
9




•


•
•
•
→
10




•
•             9484
    (                                        )
• kh              (β   )
•         (                     )      t2c


•
•       (48732)        (1408)       (49)
11




/$N / Fb /$K / =U /$O / Mh / $K / $1$j / 2) /$N / E`$l / $k / N^ / :# /$d / 2r$/ / $i$`




•     –            –            –        ...
                                                       ..
12




•
•
•
13




•
•
     (    , 1983)
•
     (    , 1996)


    idf (inverse document frequency)
                     (       )
14



idf (Sp¨rck Jones, 1972)
       a

                        N
     idf (t, N ) = log
                       df (t)


                         N
  idf (ari, N ) = log            (1)
                      df (ari)
                      9484
                = log            (2)
                      1201
                = log 7.89..     (3)
                = 2.07..         (4)
15



idf (Sp¨rck Jones, 1972)
       a

                          N
       idf (t, N ) = log
                         df (t)


                            N
idf (uguisu, N ) = log               (5)
                       df (uguisu)
                       9484
                 = log               (6)
                        101
                 = log 93.90..       (7)
                  = 4.54..           (8)
16



                 3500
                                        L-Shape Freq-Type

                 3000


                 2500
number of type




                 2000


                 1500


                 1000


                 500


                   0
                        0   200 400 600 800 100012001400160018002000
                                         frequency
17



                 1200             idf
                                        J-Shape IDF-Type


                 1000


                            idf
                 800
number of type



                                        idf
                            idf
                 600


                 400


                 200


                   0
                        1    2    3    4    5    6    7    8   9
                            inverse document frequency (idf)
18




•                                (     )




•


• tfidf

         w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )
19



                                     (cw)


             w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )    (9)
                               √
       cidf (t1 , t2 , N ) =   idf (t1 , N ) idf (t2 , N )   (10)
         ctf (t1 , t2 , K) = 1 + log |{k : t1 , t2 ∈ k}|     (11)

• K
• (10)
    → cidf

• (11)       K
•
20



                                           cidf

                                                               ˙
                        1000
frequency of patterns




                        800



                        600



                        400



                        200



                          0
                               0   1   2   3      4    5   6   7   8   9
                                                  cidf
21



                                                     (cw)


                                       |N |
ictf (t1 , t2 , N ) = 1 + log                                                   (12)
                               |{n : t1 , t2 ∈ n}|
     cw(t1 , t2 ) = ctf (t1 , t2 , K) ictf (t1 , t2 , N ) cidf (t1 , t2 , N )   (13)

         • K                                     N

         •

         •                       K

         •                       N
22



                                             cw
                                   900
                                                              ¨       ‚¯”£
                                                                         1
cumulative frequency of patterns              8                          2
                                   800                                   3
                                                                         4
                                   700        1                          5
                                                                         6
                                                                         7
                                   600                                   8
                                              3
                                   500

                                   400
                                              7
                                              2
                                   300

                                   200        5                        cw     z
                                              6

                                   100        4

                                    0
                                         0   10   20 30 40 50 60 70 80            90 100
                                                  co-occurrence weight (cw)
23



1σ




         16
     (        )
24
25
26
27
28
                                       (1)

        t1 –t2     cw       z   ctf   idf (t1 )   idf (t2 )
(24)       –     86.06   3.33    10      3.18        4.63
           –     65.15   1.76     5      3.18        3.26
           –     64.32   1.70     2      3.43        4.69
           –     63.36   1.62     2      3.18        4.92
           –     61.87   1.51     2      3.18        4.69
           –     60.36   1.40     4      3.18        3.18
           –     55.34   1.02     2      3.18        4.37
(11)       –     54.69   1.33     3      3.18        4.63
           –     52.40   1.12     3      3.18        3.26
           –     51.40   1.03     1      3.18        8.06
           –     51.28   1.02     2      3.43        4.63
(15)       –     80.25   3.74     8      3.18        4.63
           –     55.90   1.54     2      3.18        3.83
           –     54.92   1.46     8      3.18        2.08
           –     54.35   1.40     2      3.18        3.95
           –     52.42   1.23     2      3.18        3.37
           –     50.48   1.05     1      3.18        7.77
  (3)   N/A
29
                                        (2)

         t1 –t2     cw       z   ctf   idf (t1 )   idf (t2 )
(5)         –     72.27   3.34     4      3.43        4.63
            –     52.17   1.44     2      3.43        3.95
            –     51.68   1.40     2      3.43        3.71
            –     51.00   1.33     2      3.43        3.43
            –     49.48   1.19     4      3.43        2.08
            –     48.33   1.08     1      3.43        6.59
            –     47.56   1.01     1      3.43        6.38
(6)      N/A
(9)      N/A
  (24)      –     63.56   1.64    3       3.43        4.63
            –     62.38   1.55    3       3.43        3.14
            –     62.18   1.53    4       3.18        4.63
            –     56.96   1.14    1       3.43        9.16
30




•


•   (cw)   z       1σ

      1σ(16    )
•


•
31




•
•
•


•
    http://etymology.jp/waka/poem.cgi
    XML(SVG)
•

Keio slide

  • 1.
  • 2.
    2 • – – – – 1000 • (Goodenough, 1981) ( )
  • 3.
    3 • — • • • •
  • 4.
  • 5.
    5 • (2005)— • (2006)— • • • •
  • 6.
    6 • • • ( , 1983; , 1989) → p.2 •
  • 7.
    7 ) ) ) ) 07 ) 86 4) 44) ) 205 05 51 0 (1 0 2 11 •11 18 (1 8 ( •9 ( •9 ( •1 (• ( (1 8 q= 8 8 d =8 =8 =8 :# = e@ 0d= =&0 MU l2V=8 =8 78E:# 8 E 8 =& 8e 6b ; @i: ? 46 56 79 38 20 44 17 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 900 950 1000 1050 1100 1150 1200 1250
  • 8.
    8 1. 2. (1976) • • 3. (1991) 4. (1998)
  • 9.
  • 10.
    10 • • 9484 ( ) • kh (β ) • ( ) t2c • • (48732) (1408) (49)
  • 11.
    11 /$N / Fb/$K / =U /$O / Mh / $K / $1$j / 2) /$N / E`$l / $k / N^ / :# /$d / 2r$/ / $i$` • – – – ... ..
  • 12.
  • 13.
    13 • • ( , 1983) • ( , 1996) idf (inverse document frequency) ( )
  • 14.
    14 idf (Sp¨rck Jones,1972) a N idf (t, N ) = log df (t) N idf (ari, N ) = log (1) df (ari) 9484 = log (2) 1201 = log 7.89.. (3) = 2.07.. (4)
  • 15.
    15 idf (Sp¨rck Jones,1972) a N idf (t, N ) = log df (t) N idf (uguisu, N ) = log (5) df (uguisu) 9484 = log (6) 101 = log 93.90.. (7) = 4.54.. (8)
  • 16.
    16 3500 L-Shape Freq-Type 3000 2500 number of type 2000 1500 1000 500 0 0 200 400 600 800 100012001400160018002000 frequency
  • 17.
    17 1200 idf J-Shape IDF-Type 1000 idf 800 number of type idf idf 600 400 200 0 1 2 3 4 5 6 7 8 9 inverse document frequency (idf)
  • 18.
    18 • ( ) • • tfidf w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )
  • 19.
    19 (cw) w(t, K, N ) = (1 + log tf (t, K)) idf (t, N ) (9) √ cidf (t1 , t2 , N ) = idf (t1 , N ) idf (t2 , N ) (10) ctf (t1 , t2 , K) = 1 + log |{k : t1 , t2 ∈ k}| (11) • K • (10) → cidf • (11) K •
  • 20.
    20 cidf ˙ 1000 frequency of patterns 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 cidf
  • 21.
    21 (cw) |N | ictf (t1 , t2 , N ) = 1 + log (12) |{n : t1 , t2 ∈ n}| cw(t1 , t2 ) = ctf (t1 , t2 , K) ictf (t1 , t2 , N ) cidf (t1 , t2 , N ) (13) • K N • • K • N
  • 22.
    22 cw 900 ¨ ‚¯”£ 1 cumulative frequency of patterns 8 2 800 3 4 700 1 5 6 7 600 8 3 500 400 7 2 300 200 5 cw z 6 100 4 0 0 10 20 30 40 50 60 70 80 90 100 co-occurrence weight (cw)
  • 23.
    23 1σ 16 ( )
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    28 (1) t1 –t2 cw z ctf idf (t1 ) idf (t2 ) (24) – 86.06 3.33 10 3.18 4.63 – 65.15 1.76 5 3.18 3.26 – 64.32 1.70 2 3.43 4.69 – 63.36 1.62 2 3.18 4.92 – 61.87 1.51 2 3.18 4.69 – 60.36 1.40 4 3.18 3.18 – 55.34 1.02 2 3.18 4.37 (11) – 54.69 1.33 3 3.18 4.63 – 52.40 1.12 3 3.18 3.26 – 51.40 1.03 1 3.18 8.06 – 51.28 1.02 2 3.43 4.63 (15) – 80.25 3.74 8 3.18 4.63 – 55.90 1.54 2 3.18 3.83 – 54.92 1.46 8 3.18 2.08 – 54.35 1.40 2 3.18 3.95 – 52.42 1.23 2 3.18 3.37 – 50.48 1.05 1 3.18 7.77 (3) N/A
  • 29.
    29 (2) t1 –t2 cw z ctf idf (t1 ) idf (t2 ) (5) – 72.27 3.34 4 3.43 4.63 – 52.17 1.44 2 3.43 3.95 – 51.68 1.40 2 3.43 3.71 – 51.00 1.33 2 3.43 3.43 – 49.48 1.19 4 3.43 2.08 – 48.33 1.08 1 3.43 6.59 – 47.56 1.01 1 3.43 6.38 (6) N/A (9) N/A (24) – 63.56 1.64 3 3.43 4.63 – 62.38 1.55 3 3.43 3.14 – 62.18 1.53 4 3.18 4.63 – 56.96 1.14 1 3.43 9.16
  • 30.
    30 • • (cw) z 1σ 1σ(16 ) • •
  • 31.
    31 • • • • http://etymology.jp/waka/poem.cgi XML(SVG) •