Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Quick Search algorithm       and strstr             Cybozu Labs2012/3/31 MITSUNARI Shigeo(@herumi)x86/x64 optimization sem...
Agenda Quick Search vs. strstr of gcc on Core2Duo vs. strstr of gcc on Xeon fast strstr using strchr of gcc vs. my im...
Quick Search algorithm(1/2) Simplified and improved Boyer-Moore algorithm   initialized table for "this is"    char     ...
Quick Search algorithm(2/2) Searching phase   simple and fast   see http://www-igm.univ-mlv.fr/~lecroq/string/node19.ht...
Benchmark 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8   33MB UTF-8 text                                  10             ...
A little modification of Qs avoid memcmp              const char *find(const char *begin, const char *end) {             ...
Benchmark again 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8   33MB UTF-8 text                                  10       ...
strstr on gcc 4.6 with SSE4.2 Xeon X5650 2.67Gz on Linux strstr with SSE4.2 is faster than Qs’  for substring with lengt...
strstr implementation by strchr Find candidate of location by strchr at first,  and verify the correctness   strchr of g...
strstr vs. mystrstr_C Xeon X5650 2.6GHz + gcc 4.6.1 mystrstr_C is 1.5 ~ 3 times faster than strstr   except for “ko-re-...
real speed of SSE4.2(pcmpistri) my_strstr is always faster than Qs’   2 ~ 4 times faster than strstr of gcc             ...
Implementation of my_strstr(1/2) https://github.com/herumi/opti/blob/master/str_util.hpp   written in Xbyak(for my conve...
Implementation of my_strstr(2/2) Compare tail in“headCmp”         ...         add(a, c); // get position         mov(save...
Pros and Cons of my_strstr Pros   very fast   Is this implementation with Qs fastest?     No, overhead is almost large...
strstr of Visual Studio 11 almost same speed as my_strstr of Couse safe to use   i7-2620 3.4GHz + Windows 7 + VS 11beta...
All benchmarks on i7-2600 find "ko-re-wa" in 33MiB text   the results strongly depends on text and key                  ...
range version of strstr strstr is not available for string including‘¥0’ use std::string.find()   but it is not optimiz...
feature of pcmpestri very complex mnemonicsxmm0 : head of key pcmpestri xmm0, ptr [p], 12 rax : keySize p : pointer to te...
Difference between Xeon and i7 main loop of my_strstr    L(".lp");        pcmpistri(xmm0, ptr [a], 12);        if (isSand...
other features of str_util.hpp strchr_any(text, key)[or findChar_any]   returns a pointer to the first occurrence of any...
Upcoming SlideShare
Loading in …5
×

Quick Search algorithm and strstr

5,976 views

Published on

strstr with SSE4.2 is faster than quick search

Published in: Technology
  • Be the first to comment

Quick Search algorithm and strstr

  1. 1. Quick Search algorithm and strstr Cybozu Labs2012/3/31 MITSUNARI Shigeo(@herumi)x86/x64 optimization seminar 3(#x86opti)
  2. 2. Agenda Quick Search vs. strstr of gcc on Core2Duo vs. strstr of gcc on Xeon fast strstr using strchr of gcc vs. my implementation on Xeon  restriction vs. strstr of VC2011 beta on Core i7 feature of pcmpestri range version of strstr2012/3/31 #x86opti 2 /20
  3. 3. Quick Search algorithm(1/2) Simplified and improved Boyer-Moore algorithm  initialized table for "this is" char t h I s other skip +7 +6 +2 +1 +3 +8 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8  How to initialize table for given 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 string [str, str + len) 3 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 int tbl_[256]; 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 void init(const char *str, int len) { 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 std::fill(tbl_, tbl_ + 256, len); A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 for (size_t i = 0; i < len; i++) { B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 tbl_[str[i]] = len - i; D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 }2012/3/31 #x86opti 3 /20
  4. 4. Quick Search algorithm(2/2) Searching phase  simple and fast  see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; };2012/3/31 #x86opti 4 /20
  5. 5. Benchmark 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 2 strstr 0 org Qs substring  Qs(Quick search) is faster for long substring  Remark: assume text does not have ‘¥0’for strstr2012/3/31 #x86opti 5 /20
  6. 6. A little modification of Qs avoid memcmp const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; } const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; }2012/3/31 #x86opti 6 /20
  7. 7. Benchmark again 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 strstr 2 org Qs 0 Qs substring  modified Qs(Qs’) is more faster  Should we use modified Qs?2012/3/31 #x86opti 7 /20
  8. 8. strstr on gcc 4.6 with SSE4.2 Xeon X5650 2.67Gz on Linux strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte 5 cycle/Byte to find 4 fast 3 2 1 strstr 0 Qs substring Is strstr of gcc is fastest implementation?2012/3/31 #x86opti 8 /20
  9. 9. strstr implementation by strchr Find candidate of location by strchr at first, and verify the correctness  strchr of gcc with SSE4.2 is fast const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; }2012/3/31 #x86opti 9 /20
  10. 10. strstr vs. mystrstr_C Xeon X5650 2.6GHz + gcc 4.6.1 mystrstr_C is 1.5 ~ 3 times faster than strstr  except for “ko-re-wa”(in UTF-8)  maybe penalty for many bad candidates 10 cycle/Byte to find 8 fast 6 4 strstr 2 Qs 0 my_strstr_C substring2012/3/31 #x86opti 10 /20
  11. 11. real speed of SSE4.2(pcmpistri) my_strstr is always faster than Qs’  2 ~ 4 times faster than strstr of gcc 10 8 fast cycle/Byte to find 6 4 strstr Qs 2 my_strstr_C 0 my_strstr substring2012/3/31 #x86opti 11 /20
  12. 12. Implementation of my_strstr(1/2) https://github.com/herumi/opti/blob/master/str_util.hpp  written in Xbyak(for my convenience) Main loop // a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound");2012/3/31 #x86opti 12 /20
  13. 13. Implementation of my_strstr(2/2) Compare tail in“headCmp” ... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp");2012/3/31 #x86opti 13 /20
  14. 14. Pros and Cons of my_strstr Pros  very fast  Is this implementation with Qs fastest? No, overhead is almost larger(variable address offset) Cons  access max 16 bytes beyond of the end of text almost no problem except for page boundary allocate memory with margin 4KiB readable page not readable page FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 001 002 003 access pcmpistri violation end of text2012/3/31 #x86opti 14 /20
  15. 15. strstr of Visual Studio 11 almost same speed as my_strstr of Couse safe to use  i7-2620 3.4GHz + Windows 7 + VS 11beta 8 cycle/Byte to find 6 fast 4 2 strstr Qs 0 my_strstr substring2012/3/31 #x86opti 15 /20
  16. 16. All benchmarks on i7-2600 find "ko-re-wa" in 33MiB text  the results strongly depends on text and key strstr(before SSE4.2) fast Qs(gcc) Qs(gcc) strstr(gcc;SSE4.2) strstr(VC;SSE4.2) my_strstr(SSE4.2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 rate for the timing of Qs(gcc)2012/3/31 #x86opti 16 /20
  17. 17. range version of strstr strstr is not available for string including‘¥0’ use std::string.find()  but it is not optimized for SSE4.2 naive but fast implementation by C const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; } str_util.hpp provides findStr with SSE4.2  4 ~ 5 times faster than findStr_C on i7-2600 + VC112012/3/31 #x86opti 17 /20
  18. 18. feature of pcmpestri very complex mnemonicsxmm0 : head of key pcmpestri xmm0, ptr [p], 12 rax : keySize p : pointer to text rcx : pos of key if found rdx : text size CF : if found ZF : end of text SF : end of key OF : all match L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); do not change carry ja(".lp"); jnc(".notFound"); // compare leading str...2012/3/31 #x86opti 18 /20
  19. 19. Difference between Xeon and i7 main loop of my_strstr L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); a little faster on i7 } else { jbe(".headCmp"); add(a, 16); 1.1 times faster on Xeon jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); }2012/3/31 #x86opti 19 /20
  20. 20. other features of str_util.hpp strchr_any(text, key)[or findChar_any]  returns a pointer to the first occurrence of any character of key int the text // search character position of ?, #, $, !, /, : strchr_any(text,"?#$!/:");  same speed as strchr by using SSE4.2  max length of key is 16 strchr_range(txt, key)[or findChar_range]  returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...  also same speed as strchr and max len(key) = 16 // search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF");2012/3/31 #x86opti 20 /20

×