SlideShare a Scribd company logo
1 of 20
Download to read offline
Quick Search algorithm
       and strstr


             Cybozu Labs
2012/3/31 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 3(#x86opti)
Agenda
 Quick Search
 vs. strstr of gcc on Core2Duo
 vs. strstr of gcc on Xeon
 fast strstr using strchr of gcc
 vs. my implementation on Xeon
   restriction
 vs. strstr of VC2011 beta on Core i7
 feature of pcmpestri
 range version of strstr



2012/3/31 #x86opti                       2 /20
Quick Search algorithm(1/2)
 Simplified and improved Boyer-Moore algorithm
   initialized table for "this is"
    char      't'    'h'   'I'   's'      ''   other
    skip     +7      +6    +2    +1       +3    +8
                                                           0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
                                                       0   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       1   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
   How to initialize table for given                  2   3   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
    string [str, str + len)                            3
                                                       4
                                                           8
                                                           8
                                                               8
                                                               8
                                                                   8
                                                                   8
                                                                       8
                                                                       8
                                                                           8
                                                                           8
                                                                               8
                                                                               8
                                                                                   8
                                                                                   8
                                                                                       8
                                                                                       8
                                                                                           8
                                                                                           8
                                                                                               8
                                                                                               8
                                                                                                   8
                                                                                                   8
                                                                                                       8
                                                                                                       8
                                                                                                           8
                                                                                                           8
                                                                                                               8
                                                                                                               8
                                                                                                                   8
                                                                                                                   8
                                                                                                                       8
                                                                                                                       8
                                                       5   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
  int tbl_[256];                                       6   8   8   8   8   8   8   8   8   6   2   8   8   8   8   8   8
                                                       7   8   8   8   1   7   8   8   8   8   8   8   8   8   8   8   8
  void init(const char *str,           int len) {      8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       9   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      std::fill(tbl_, tbl_ +           256, len);      A   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      for (size_t i = 0; i <           len; i++) {     B   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       C   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
          tbl_[str[i]] = len           - i;            D   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      }                                                E   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       F   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
  }
2012/3/31 #x86opti                                                                                                         3 /20
Quick Search algorithm(2/2)
 Searching phase
   simple and fast
   see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html

       const char *find(const char *begin, const char *end) {
           while (begin <= end - len_) {
               if (memcmp(str_, begin, len_) == 0) return begin;
               begin += tbl_[begin[len_]];
           }
           return end;
       };




2012/3/31 #x86opti                                                 4 /20
Benchmark
 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
   33MB UTF-8 text
                                  10
             cycle/Byte to find



                                   8               fast
                                   6
                                   4
                                   2                      strstr
                                   0                      org Qs




                                       substring
   Qs(Quick search) is faster for long substring
   Remark: assume text does not have ‘¥0’for strstr
2012/3/31 #x86opti                                                 5 /20
A little modification of Qs
 avoid memcmp
              const char *find(const char *begin, const char *end) {
                  while (begin <= end - len_) {
                      if (memcmp(str_, begin, len_) == 0) return begin;
                      begin += tbl_[begin[len_]];
                  }
                  return end; }



     const char *find(const char *begin, const char *end){
         while (begin <= end - len_) {
             for (size_t i = 0; i < len_; i++) {
                 if (str_[i] != begin[i]) goto NEXT;
             }
             return begin;
         NEXT:
             begin += tbl_[static_cast<unsigned char>(begin[len_])];
         }
         return end; }
2012/3/31 #x86opti                                                        6 /20
Benchmark again
 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
   33MB UTF-8 text
                                  10
             cycle/Byte to find



                                   8               fast
                                   6
                                   4
                                                          strstr
                                   2
                                                          org Qs
                                   0
                                                          Qs'



                                       substring

   modified Qs(Qs’) is more faster
   Should we use modified Qs'?
2012/3/31 #x86opti                                                 7 /20
strstr on gcc 4.6 with SSE4.2
 Xeon X5650 2.67Gz on Linux
 strstr with SSE4.2 is faster than Qs’
  for substring with length less than 9 byte
                                  5
             cycle/Byte to find




                                  4               fast
                                  3
                                  2
                                  1                      strstr
                                  0                      Qs'




                                      substring

 Is strstr of gcc is fastest implementation?
2012/3/31 #x86opti                                                8 /20
strstr implementation by strchr
 Find candidate of location by strchr at first,
  and verify the correctness
   strchr of gcc with SSE4.2 is fast

     const char *mystrstr_C(const char *str, const char *key) {
         size_t len = strlen(key);
         while (*str) {
             const char *p = strchr(str, key[0]);
             if (p == 0) return 0;
             if (memcmp(p + 1, key + 1, len - 1) == 0) return p;
             str = p + 1;
         }
         return 0;
     }



2012/3/31 #x86opti                                                 9 /20
strstr vs. mystrstr_C
 Xeon X5650 2.6GHz + gcc 4.6.1
 mystrstr_C is 1.5 ~ 3 times faster than strstr
   except for “ko-re-wa”(in UTF-8)
   maybe penalty for many bad candidates
                              10
         cycle/Byte to find




                               8        fast
                               6
                               4
                                               strstr
                               2
                                               Qs'
                               0
                                               my_strstr_C


                                   substring

2012/3/31 #x86opti                                           10 /20
real speed of SSE4.2(pcmpistri)
 my_strstr is always faster than Qs’
   2 ~ 4 times faster than strstr of gcc

                              10

                               8               fast
         cycle/Byte to find




                               6

                               4                      strstr
                                                      Qs'
                               2
                                                      my_strstr_C
                               0                      my_strstr



                                   substring

2012/3/31 #x86opti                                                  11 /20
Implementation of my_strstr(1/2)
 https://github.com/herumi/opti/blob/master/str_util.hpp
   written in Xbyak(for my convenience)
 Main loop
     // a : rax(or eax), c : rcx(or ecx)
     //       input a : ptr to text
     //             key : ptr to key
     //       use save_a, save_key, c

         movdqu(xm0, ptr [key]); // xm0 = *key
     L(".lp");
         pcmpistri(xmm0, ptr [a], 12);
               // 12(1100b) = [equal ordered:unsigned:byte]
         jbe(".headCmp");
         add(a, 16);
         jmp(".lp");
     L(".headCmp");
         jnc(".notFound");
2012/3/31 #x86opti                                            12 /20
Implementation of my_strstr(2/2)
 Compare tail in“headCmp”
         ...
         add(a, c); // get position
         mov(save_a, a); // save a
         mov(save_key, key); // save key
     L(".tailCmp");
         movdqu(xm1, ptr [save_key]);
         pcmpistri(xmm1, ptr [save_a], 12);
         jno(".next");
         js(".found");
         // rare case
         add(save_a, 16);
         add(save_key, 16);
         jmp(".tailCmp");
     L(".next");
         add(a, 1);
         jmp(".lp");
2012/3/31 #x86opti                            13 /20
Pros and Cons of my_strstr
 Pros
   very fast
   Is this implementation with Qs fastest?
     No, overhead is almost larger(variable address offset)
 Cons
   access max 16 bytes beyond of the end of text
     almost no problem except for page boundary
     allocate memory with margin
             4KiB readable page                                         not readable page
          FF7   FF8   FF9   FFA   FFB   FFC   FFD   FFE   FFF   000   001   002   003




                                                                 access
                       pcmpistri                                violation
                                        end of text
2012/3/31 #x86opti                                                                      14 /20
strstr of Visual Studio 11
 almost same speed as my_strstr
 of Couse safe to use
   i7-2620 3.4GHz + Windows 7 + VS 11beta
                               8
          cycle/Byte to find




                               6        fast
                               4

                               2               strstr
                                               Qs'
                               0
                                               my_strstr


                                   substring



2012/3/31 #x86opti                                         15 /20
All benchmarks on i7-2600
 find "ko-re-wa" in 33MiB text
   the results strongly depends on text and key


                                                                strstr(before SSE4.2)
                                                      fast
                                                                Qs(gcc)

                                                                Qs'(gcc)

                                                                strstr(gcc;SSE4.2)

                                                                strstr(VC;SSE4.2)

                                                                my_strstr(SSE4.2)


       0   1   2     3   4   5   6   7   8   9 10 11 12 13 14
                     rate for the timing of Qs(gcc)

2012/3/31 #x86opti                                                                      16 /20
range version of strstr
 strstr is not available for string including‘¥0’
 use std::string.find()
   but it is not optimized for SSE4.2
     naive but fast implementation by C
     const char *findStr_C(const char *begin, const char *end,
       const char *key, size_t keySize) {
       while (begin + keySize <= end) {
         const char *p = memchr(begin, key[0], end - begin);
         if (p == 0) break;
         if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p;
         begin = p + 1;
       }
       return end; }

 str_util.hpp provides findStr with SSE4.2
   4 ~ 5 times faster than findStr_C on i7-2600 + VC11
2012/3/31 #x86opti                                                17 /20
feature of pcmpestri
 very complex mnemonics
xmm0 : head of key pcmpestri xmm0, ptr [p], 12
 rax : keySize
 p : pointer to text
                                  rcx : pos of key if found
 rdx : text size
                                  CF : if found
                                  ZF : end of text
                                  SF : end of key
                                  OF : all match
    L(".lp");
        pcmpestri(xmm0, ptr [p], 12);
        lea(p, ptr [p + 16]);
        lea(d, ptr [d - 16]);           do not change carry
        ja(".lp");
        jnc(".notFound");
        // compare leading str...
2012/3/31 #x86opti                                            18 /20
Difference between Xeon and i7
 main loop of my_strstr
    L(".lp");
        pcmpistri(xmm0, ptr [a], 12);
        if (isSandyBridge) {
            lea(a, ptr [a + 16]);
            ja(".lp");                      a little faster on i7
        } else {
            jbe(".headCmp");
            add(a, 16);                 1.1 times faster on Xeon
            jmp(".lp");
    L(".headCmp");
        }
        jnc(".notFound");
        // get position
        if (isSandyBridge) {
            lea(a, ptr [a + c - 16]);
        } else {
            add(a, c);
        }
2012/3/31 #x86opti                                                  19 /20
other features of str_util.hpp
 strchr_any(text, key)[or findChar_any]
   returns a pointer to the first occurrence of any
    character of key int the text
  // search character position of '?', '#', '$', '!', '/', ':'
  strchr_any(text,"?#$!/:");
   same speed as strchr by using SSE4.2
   max length of key is 16
 strchr_range(txt, key)[or findChar_range]
   returns a pointer to the first occurrence of a
    character in range [key[0], key[1]], [key[2], key[3]], ...
   also same speed as strchr and max len(key) = 16
  // search character position of [0-9], [a-f], [A-F]
  strchr_range(text,"09afAF");
2012/3/31 #x86opti                                               20 /20

More Related Content

What's hot

自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介MITSUNARI Shigeo
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
図と実装で理解する『木構造入門』
図と実装で理解する『木構造入門』図と実装で理解する『木構造入門』
図と実装で理解する『木構造入門』Proktmr
 
ブロックチェーン系プロジェクトで着目される暗号技術
ブロックチェーン系プロジェクトで着目される暗号技術ブロックチェーン系プロジェクトで着目される暗号技術
ブロックチェーン系プロジェクトで着目される暗号技術MITSUNARI Shigeo
 
プログラミングコンテストでの乱択アルゴリズム
プログラミングコンテストでの乱択アルゴリズムプログラミングコンテストでの乱択アルゴリズム
プログラミングコンテストでの乱択アルゴリズムTakuya Akiba
 
条件分岐とcmovとmaxps
条件分岐とcmovとmaxps条件分岐とcmovとmaxps
条件分岐とcmovとmaxpsMITSUNARI Shigeo
 
プログラミングコンテストでのデータ構造 2 ~動的木編~
プログラミングコンテストでのデータ構造 2 ~動的木編~プログラミングコンテストでのデータ構造 2 ~動的木編~
プログラミングコンテストでのデータ構造 2 ~動的木編~Takuya Akiba
 
暗号技術入門 秘密の国のアリス 総集編
暗号技術入門 秘密の国のアリス 総集編暗号技術入門 秘密の国のアリス 総集編
暗号技術入門 秘密の国のアリス 総集編京大 マイコンクラブ
 
AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説AtCoder Inc.
 
様々な全域木問題
様々な全域木問題様々な全域木問題
様々な全域木問題tmaehara
 

What's hot (20)

自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介自作ペアリング/BLS署名ライブラリの紹介
自作ペアリング/BLS署名ライブラリの紹介
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
図と実装で理解する『木構造入門』
図と実装で理解する『木構造入門』図と実装で理解する『木構造入門』
図と実装で理解する『木構造入門』
 
ブロックチェーン系プロジェクトで着目される暗号技術
ブロックチェーン系プロジェクトで着目される暗号技術ブロックチェーン系プロジェクトで着目される暗号技術
ブロックチェーン系プロジェクトで着目される暗号技術
 
Arc 010 d
Arc 010 dArc 010 d
Arc 010 d
 
プログラミングコンテストでの乱択アルゴリズム
プログラミングコンテストでの乱択アルゴリズムプログラミングコンテストでの乱択アルゴリズム
プログラミングコンテストでの乱択アルゴリズム
 
条件分岐とcmovとmaxps
条件分岐とcmovとmaxps条件分岐とcmovとmaxps
条件分岐とcmovとmaxps
 
プログラミングコンテストでのデータ構造 2 ~動的木編~
プログラミングコンテストでのデータ構造 2 ~動的木編~プログラミングコンテストでのデータ構造 2 ~動的木編~
プログラミングコンテストでのデータ構造 2 ~動的木編~
 
Convex Hull Trick
Convex Hull TrickConvex Hull Trick
Convex Hull Trick
 
arc047
arc047arc047
arc047
 
暗認本読書会11
暗認本読書会11暗認本読書会11
暗認本読書会11
 
集約署名
集約署名集約署名
集約署名
 
Chokudai search
Chokudai searchChokudai search
Chokudai search
 
暗号技術入門 秘密の国のアリス 総集編
暗号技術入門 秘密の国のアリス 総集編暗号技術入門 秘密の国のアリス 総集編
暗号技術入門 秘密の国のアリス 総集編
 
Binary indexed tree
Binary indexed treeBinary indexed tree
Binary indexed tree
 
RMQ クエリ処理
RMQ クエリ処理RMQ クエリ処理
RMQ クエリ処理
 
グラフネットワーク〜フロー&カット〜
グラフネットワーク〜フロー&カット〜グラフネットワーク〜フロー&カット〜
グラフネットワーク〜フロー&カット〜
 
AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説
 
様々な全域木問題
様々な全域木問題様々な全域木問題
様々な全域木問題
 
暗認本読書会10
暗認本読書会10暗認本読書会10
暗認本読書会10
 

Similar to Quick Search algorithm and strstr

Similar to Quick Search algorithm and strstr (15)

Sandu TAB
Sandu TABSandu TAB
Sandu TAB
 
Lagralita
LagralitaLagralita
Lagralita
 
Lagralita
LagralitaLagralita
Lagralita
 
Lagralita
LagralitaLagralita
Lagralita
 
Existing Utilities
Existing UtilitiesExisting Utilities
Existing Utilities
 
Grow By Learning 2009[1] Sept To Dec
Grow By Learning 2009[1] Sept To DecGrow By Learning 2009[1] Sept To Dec
Grow By Learning 2009[1] Sept To Dec
 
Grow By Learning[2] September December 2008
Grow By Learning[2] September   December 2008Grow By Learning[2] September   December 2008
Grow By Learning[2] September December 2008
 
MIS - Plant Performance
MIS - Plant PerformanceMIS - Plant Performance
MIS - Plant Performance
 
16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis
 
16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis
 
T2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage CrisisT2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage Crisis
 
T2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage CrisisT2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage Crisis
 
TEMS Total Energy Management Service
TEMS Total Energy Management ServiceTEMS Total Energy Management Service
TEMS Total Energy Management Service
 
Thoughts of a friend
Thoughts of a friendThoughts of a friend
Thoughts of a friend
 
HCI in IoT
HCI in IoTHCI in IoT
HCI in IoT
 

More from MITSUNARI Shigeo

暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学MITSUNARI Shigeo
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコルMITSUNARI Shigeo
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advancedMITSUNARI Shigeo
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenMITSUNARI Shigeo
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用MITSUNARI Shigeo
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介MITSUNARI Shigeo
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたMITSUNARI Shigeo
 
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用MITSUNARI Shigeo
 

More from MITSUNARI Shigeo (20)

暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advanced
 
暗認本読書会12
暗認本読書会12暗認本読書会12
暗認本読書会12
 
暗認本読書会9
暗認本読書会9暗認本読書会9
暗認本読書会9
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
 
暗認本読書会8
暗認本読書会8暗認本読書会8
暗認本読書会8
 
暗認本読書会7
暗認本読書会7暗認本読書会7
暗認本読書会7
 
暗認本読書会6
暗認本読書会6暗認本読書会6
暗認本読書会6
 
暗認本読書会5
暗認本読書会5暗認本読書会5
暗認本読書会5
 
暗認本読書会4
暗認本読書会4暗認本読書会4
暗認本読書会4
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
私とOSSの25年
私とOSSの25年私とOSSの25年
私とOSSの25年
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 
HPC Phys-20201203
HPC Phys-20201203HPC Phys-20201203
HPC Phys-20201203
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみた
 
ゆるバグ
ゆるバグゆるバグ
ゆるバグ
 
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
 

Recently uploaded

Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Recently uploaded (20)

Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Quick Search algorithm and strstr

  • 1. Quick Search algorithm and strstr Cybozu Labs 2012/3/31 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 3(#x86opti)
  • 2. Agenda  Quick Search  vs. strstr of gcc on Core2Duo  vs. strstr of gcc on Xeon  fast strstr using strchr of gcc  vs. my implementation on Xeon  restriction  vs. strstr of VC2011 beta on Core i7  feature of pcmpestri  range version of strstr 2012/3/31 #x86opti 2 /20
  • 3. Quick Search algorithm(1/2)  Simplified and improved Boyer-Moore algorithm  initialized table for "this is" char 't' 'h' 'I' 's' '' other skip +7 +6 +2 +1 +3 +8 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8  How to initialize table for given 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 string [str, str + len) 3 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 int tbl_[256]; 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 void init(const char *str, int len) { 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 std::fill(tbl_, tbl_ + 256, len); A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 for (size_t i = 0; i < len; i++) { B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 tbl_[str[i]] = len - i; D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } 2012/3/31 #x86opti 3 /20
  • 4. Quick Search algorithm(2/2)  Searching phase  simple and fast  see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; }; 2012/3/31 #x86opti 4 /20
  • 5. Benchmark  2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 2 strstr 0 org Qs substring  Qs(Quick search) is faster for long substring  Remark: assume text does not have ‘¥0’for strstr 2012/3/31 #x86opti 5 /20
  • 6. A little modification of Qs  avoid memcmp const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; } const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; } 2012/3/31 #x86opti 6 /20
  • 7. Benchmark again  2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 strstr 2 org Qs 0 Qs' substring  modified Qs(Qs’) is more faster  Should we use modified Qs'? 2012/3/31 #x86opti 7 /20
  • 8. strstr on gcc 4.6 with SSE4.2  Xeon X5650 2.67Gz on Linux  strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte 5 cycle/Byte to find 4 fast 3 2 1 strstr 0 Qs' substring  Is strstr of gcc is fastest implementation? 2012/3/31 #x86opti 8 /20
  • 9. strstr implementation by strchr  Find candidate of location by strchr at first, and verify the correctness  strchr of gcc with SSE4.2 is fast const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; } 2012/3/31 #x86opti 9 /20
  • 10. strstr vs. mystrstr_C  Xeon X5650 2.6GHz + gcc 4.6.1  mystrstr_C is 1.5 ~ 3 times faster than strstr  except for “ko-re-wa”(in UTF-8)  maybe penalty for many bad candidates 10 cycle/Byte to find 8 fast 6 4 strstr 2 Qs' 0 my_strstr_C substring 2012/3/31 #x86opti 10 /20
  • 11. real speed of SSE4.2(pcmpistri)  my_strstr is always faster than Qs’  2 ~ 4 times faster than strstr of gcc 10 8 fast cycle/Byte to find 6 4 strstr Qs' 2 my_strstr_C 0 my_strstr substring 2012/3/31 #x86opti 11 /20
  • 12. Implementation of my_strstr(1/2)  https://github.com/herumi/opti/blob/master/str_util.hpp  written in Xbyak(for my convenience)  Main loop // a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound"); 2012/3/31 #x86opti 12 /20
  • 13. Implementation of my_strstr(2/2)  Compare tail in“headCmp” ... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp"); 2012/3/31 #x86opti 13 /20
  • 14. Pros and Cons of my_strstr  Pros  very fast  Is this implementation with Qs fastest? No, overhead is almost larger(variable address offset)  Cons  access max 16 bytes beyond of the end of text almost no problem except for page boundary allocate memory with margin 4KiB readable page not readable page FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 001 002 003 access pcmpistri violation end of text 2012/3/31 #x86opti 14 /20
  • 15. strstr of Visual Studio 11  almost same speed as my_strstr  of Couse safe to use  i7-2620 3.4GHz + Windows 7 + VS 11beta 8 cycle/Byte to find 6 fast 4 2 strstr Qs' 0 my_strstr substring 2012/3/31 #x86opti 15 /20
  • 16. All benchmarks on i7-2600  find "ko-re-wa" in 33MiB text  the results strongly depends on text and key strstr(before SSE4.2) fast Qs(gcc) Qs'(gcc) strstr(gcc;SSE4.2) strstr(VC;SSE4.2) my_strstr(SSE4.2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 rate for the timing of Qs(gcc) 2012/3/31 #x86opti 16 /20
  • 17. range version of strstr  strstr is not available for string including‘¥0’  use std::string.find()  but it is not optimized for SSE4.2 naive but fast implementation by C const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; }  str_util.hpp provides findStr with SSE4.2  4 ~ 5 times faster than findStr_C on i7-2600 + VC11 2012/3/31 #x86opti 17 /20
  • 18. feature of pcmpestri  very complex mnemonics xmm0 : head of key pcmpestri xmm0, ptr [p], 12 rax : keySize p : pointer to text rcx : pos of key if found rdx : text size CF : if found ZF : end of text SF : end of key OF : all match L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); do not change carry ja(".lp"); jnc(".notFound"); // compare leading str... 2012/3/31 #x86opti 18 /20
  • 19. Difference between Xeon and i7  main loop of my_strstr L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); a little faster on i7 } else { jbe(".headCmp"); add(a, 16); 1.1 times faster on Xeon jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); } 2012/3/31 #x86opti 19 /20
  • 20. other features of str_util.hpp  strchr_any(text, key)[or findChar_any]  returns a pointer to the first occurrence of any character of key int the text // search character position of '?', '#', '$', '!', '/', ':' strchr_any(text,"?#$!/:");  same speed as strchr by using SSE4.2  max length of key is 16  strchr_range(txt, key)[or findChar_range]  returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...  also same speed as strchr and max len(key) = 16 // search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF"); 2012/3/31 #x86opti 20 /20