Suffix Array
Solr       2011/12/19




       1
•              (@nobu_k)

• Preferred Infrastructure (PFI   FI)

  •
  •
• Sedue(2 )
                      2
Suffix Array
•   Suffix Array(SA):

•   (      )                       1

    •    Sedue

•   SA

•
    •    +Sedue

•                      ”   -           ”


                               3
•
•
    •              (       )

    • n-gram(q-gram)
•
                       4
Suffix Array
•
•
• n-gram
  •
•
                5
Suffix(                )
                   0:   mississippi
                   1:   ississippi
                   2:   ssissippi
                   3:   sissippi
mississippi
                   4:   issippi
                   5:   ssippi
                   6:   sippi
                   7:   ippi
                   8:   ppi
                   9:   pi
                  10:   i
              6
Suffix Array
 0:   mississippi       10:   i
 1:   ississippi         7:   ippi
 2:   ssissippi          4:   issippi
 3:   sissippi           1:   ississippi
 4:   issippi            0:   mississippi
 5:   ssippi             9:   pi
 6:   sippi              8:   ppi
 7:   ippi               6:   sippi
 8:   ppi                3:   sissippi
 9:   pi                 5:   ssippi
10:   i                  2:   ssissippi

                    7
10:   i
 7:   ippi          •   mississippi    ’si’
 4:   issippi
 1:   ississippi    •   ’si’
 0:   mississippi
 9:   pi            •
 8:   ppi
 6:
 3:
      sippi
      sissippi
                    •
 5:   ssippi
 2:   ssissippi         •      3   6


                        8
10:   i             SA[i]:
 7:   ippi
 4:   issippi       10 7 4 1 0 9 8 6 3 5 2
 1:   ississippi     T[i]:
 0:   mississippi
 9:   pi            m i s s i s s i p p i
 8:   ppi
 6:   sippi
 3:   sissippi                   6
 5:   ssippi          T[SA[6]]
 2:   ssissippi     → T[8]
                    → “ppi”
                       9
(1/3)

T[i]:
   1    2         3   ...        n

                            SA

            SA[i]


             10
(2/3)
RedBull            !!

1. RedBull                             *2
    RedBull        SA[i]
          2.
    RedBull


     1         2        3   ...    n
                            11
(3/3)
3.
     RedBull


      1          2      3        ...          n

4.

(     1, 3), (       2, 4), (          3, 2),...,(   n, 2)

                            12
•       SA

    •                  +

    •        /n-gram

• SA
•
               13
SA
•                        (n-gram        )

    •
•               n-gram

•
    •
        •   “THIS IS IT”
    •   proximity

                                   14
SA
•
    •
    •
•
    •   HDD

    •         (        )

•
    •
                  15
•
•                                   (   )

    •   SAIS

    •
•   HDD

    •                                   (dc3, dc7)

•   Sedue      Haskell        C++

    •   @tanakh++


                         16
•                         (             )

    •
    •
        •    1       100GB/day

•   Sedue

    •   SA                n-gram

    •            n-gram

    •   SA           n-gram

    •

                                   17
HDD
•               HDD

•                     OK

    •
    •
•   SSD

    •   SSD

•   Sedue              20       (80MB)

    •   SA[i]


                           18
VS
1.   SA

     •
2.

     •    SSD+              500

3.

     •    O(N)       CPU

4.

     •
•           malloc



                           19
•   Sedue   1                  56

    •           : 40

    •             : 16          (UTF-16)

    •                           2   3

•
    •                           =

    •
        •                SSD

•

                                        20
SA
•
    •            4(+1)

        •   2-gram

    •
        •                %        OK

    •
•   ”        ”

    •
                             21
•
    •
    •
•
    •
        22
: groonga

• Sedue   groonga

  •
•
• Sedue       groonga!!


                    23
:
•

•
•              (http://jubat.us/)

    •   http://github.com/jubatus
    •   @JubatusOfficial
•                    with NTT PF

                               24
: Fluentd
•            Ruby

• Treasure Data, Inc.
  • @frsyuki, @kzk_mover
• Solr
• gem install fluentd
• Visit http://fluentd.org/doc/ now!!
                      25
•


    26

Suffix Array@Solr勉強会

  • 1.
    Suffix Array Solr 2011/12/19 1
  • 2.
    (@nobu_k) • Preferred Infrastructure (PFI FI) • • • Sedue(2 ) 2
  • 3.
    Suffix Array • Suffix Array(SA): • ( ) 1 • Sedue • SA • • +Sedue • ” - ” 3
  • 4.
    • • • ( ) • n-gram(q-gram) • 4
  • 5.
  • 6.
    Suffix( ) 0: mississippi 1: ississippi 2: ssissippi 3: sissippi mississippi 4: issippi 5: ssippi 6: sippi 7: ippi 8: ppi 9: pi 10: i 6
  • 7.
    Suffix Array 0: mississippi 10: i 1: ississippi 7: ippi 2: ssissippi 4: issippi 3: sissippi 1: ississippi 4: issippi 0: mississippi 5: ssippi 9: pi 6: sippi 8: ppi 7: ippi 6: sippi 8: ppi 3: sissippi 9: pi 5: ssippi 10: i 2: ssissippi 7
  • 8.
    10: i 7: ippi • mississippi ’si’ 4: issippi 1: ississippi • ’si’ 0: mississippi 9: pi • 8: ppi 6: 3: sippi sissippi • 5: ssippi 2: ssissippi • 3 6 8
  • 9.
    10: i SA[i]: 7: ippi 4: issippi 10 7 4 1 0 9 8 6 3 5 2 1: ississippi T[i]: 0: mississippi 9: pi m i s s i s s i p p i 8: ppi 6: sippi 3: sissippi 6 5: ssippi T[SA[6]] 2: ssissippi → T[8] → “ppi” 9
  • 10.
    (1/3) T[i]: 1 2 3 ... n SA SA[i] 10
  • 11.
    (2/3) RedBull !! 1. RedBull *2 RedBull SA[i] 2. RedBull 1 2 3 ... n 11
  • 12.
    (3/3) 3. RedBull 1 2 3 ... n 4. ( 1, 3), ( 2, 4), ( 3, 2),...,( n, 2) 12
  • 13.
    SA • + • /n-gram • SA • 13
  • 14.
    SA • (n-gram ) • • n-gram • • • “THIS IS IT” • proximity 14
  • 15.
    SA • • • • • HDD • ( ) • • 15
  • 16.
    • • ( ) • SAIS • • HDD • (dc3, dc7) • Sedue Haskell C++ • @tanakh++ 16
  • 17.
    ( ) • • • 1 100GB/day • Sedue • SA n-gram • n-gram • SA n-gram • 17
  • 18.
    HDD • HDD • OK • • • SSD • SSD • Sedue 20 (80MB) • SA[i] 18
  • 19.
    VS 1. SA • 2. • SSD+ 500 3. • O(N) CPU 4. • • malloc 19
  • 20.
    Sedue 1 56 • : 40 • : 16 (UTF-16) • 2 3 • • = • • SSD • 20
  • 21.
    SA • • 4(+1) • 2-gram • • % OK • • ” ” • 21
  • 22.
    • • • • 22
  • 23.
    : groonga • Sedue groonga • • • Sedue groonga!! 23
  • 24.
    : • • • (http://jubat.us/) • http://github.com/jubatus • @JubatusOfficial • with NTT PF 24
  • 25.
    : Fluentd • Ruby • Treasure Data, Inc. • @frsyuki, @kzk_mover • Solr • gem install fluentd • Visit http://fluentd.org/doc/ now!! 25
  • 26.
    26