Your SlideShare is downloading. ×

Quick Search algorithm and strstr

3,730

Published on

strstr with SSE4.2 is faster than quick search

strstr with SSE4.2 is faster than quick search

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,730
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
31
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Quick Search algorithm and strstr Cybozu Labs2012/3/31 MITSUNARI Shigeo(@herumi)x86/x64 optimization seminar 3(#x86opti)
  • 2. Agenda Quick Search vs. strstr of gcc on Core2Duo vs. strstr of gcc on Xeon fast strstr using strchr of gcc vs. my implementation on Xeon  restriction vs. strstr of VC2011 beta on Core i7 feature of pcmpestri range version of strstr2012/3/31 #x86opti 2 /20
  • 3. Quick Search algorithm(1/2) Simplified and improved Boyer-Moore algorithm  initialized table for "this is" char t h I s other skip +7 +6 +2 +1 +3 +8 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8  How to initialize table for given 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 string [str, str + len) 3 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 int tbl_[256]; 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 void init(const char *str, int len) { 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 std::fill(tbl_, tbl_ + 256, len); A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 for (size_t i = 0; i < len; i++) { B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 tbl_[str[i]] = len - i; D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 }2012/3/31 #x86opti 3 /20
  • 4. Quick Search algorithm(2/2) Searching phase  simple and fast  see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; };2012/3/31 #x86opti 4 /20
  • 5. Benchmark 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 2 strstr 0 org Qs substring  Qs(Quick search) is faster for long substring  Remark: assume text does not have ‘¥0’for strstr2012/3/31 #x86opti 5 /20
  • 6. A little modification of Qs avoid memcmp const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; } const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; }2012/3/31 #x86opti 6 /20
  • 7. Benchmark again 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 strstr 2 org Qs 0 Qs substring  modified Qs(Qs’) is more faster  Should we use modified Qs?2012/3/31 #x86opti 7 /20
  • 8. strstr on gcc 4.6 with SSE4.2 Xeon X5650 2.67Gz on Linux strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte 5 cycle/Byte to find 4 fast 3 2 1 strstr 0 Qs substring Is strstr of gcc is fastest implementation?2012/3/31 #x86opti 8 /20
  • 9. strstr implementation by strchr Find candidate of location by strchr at first, and verify the correctness  strchr of gcc with SSE4.2 is fast const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; }2012/3/31 #x86opti 9 /20
  • 10. strstr vs. mystrstr_C Xeon X5650 2.6GHz + gcc 4.6.1 mystrstr_C is 1.5 ~ 3 times faster than strstr  except for “ko-re-wa”(in UTF-8)  maybe penalty for many bad candidates 10 cycle/Byte to find 8 fast 6 4 strstr 2 Qs 0 my_strstr_C substring2012/3/31 #x86opti 10 /20
  • 11. real speed of SSE4.2(pcmpistri) my_strstr is always faster than Qs’  2 ~ 4 times faster than strstr of gcc 10 8 fast cycle/Byte to find 6 4 strstr Qs 2 my_strstr_C 0 my_strstr substring2012/3/31 #x86opti 11 /20
  • 12. Implementation of my_strstr(1/2) https://github.com/herumi/opti/blob/master/str_util.hpp  written in Xbyak(for my convenience) Main loop // a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound");2012/3/31 #x86opti 12 /20
  • 13. Implementation of my_strstr(2/2) Compare tail in“headCmp” ... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp");2012/3/31 #x86opti 13 /20
  • 14. Pros and Cons of my_strstr Pros  very fast  Is this implementation with Qs fastest? No, overhead is almost larger(variable address offset) Cons  access max 16 bytes beyond of the end of text almost no problem except for page boundary allocate memory with margin 4KiB readable page not readable page FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 001 002 003 access pcmpistri violation end of text2012/3/31 #x86opti 14 /20
  • 15. strstr of Visual Studio 11 almost same speed as my_strstr of Couse safe to use  i7-2620 3.4GHz + Windows 7 + VS 11beta 8 cycle/Byte to find 6 fast 4 2 strstr Qs 0 my_strstr substring2012/3/31 #x86opti 15 /20
  • 16. All benchmarks on i7-2600 find "ko-re-wa" in 33MiB text  the results strongly depends on text and key strstr(before SSE4.2) fast Qs(gcc) Qs(gcc) strstr(gcc;SSE4.2) strstr(VC;SSE4.2) my_strstr(SSE4.2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 rate for the timing of Qs(gcc)2012/3/31 #x86opti 16 /20
  • 17. range version of strstr strstr is not available for string including‘¥0’ use std::string.find()  but it is not optimized for SSE4.2 naive but fast implementation by C const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; } str_util.hpp provides findStr with SSE4.2  4 ~ 5 times faster than findStr_C on i7-2600 + VC112012/3/31 #x86opti 17 /20
  • 18. feature of pcmpestri very complex mnemonicsxmm0 : head of key pcmpestri xmm0, ptr [p], 12 rax : keySize p : pointer to text rcx : pos of key if found rdx : text size CF : if found ZF : end of text SF : end of key OF : all match L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); do not change carry ja(".lp"); jnc(".notFound"); // compare leading str...2012/3/31 #x86opti 18 /20
  • 19. Difference between Xeon and i7 main loop of my_strstr L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); a little faster on i7 } else { jbe(".headCmp"); add(a, 16); 1.1 times faster on Xeon jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); }2012/3/31 #x86opti 19 /20
  • 20. other features of str_util.hpp strchr_any(text, key)[or findChar_any]  returns a pointer to the first occurrence of any character of key int the text // search character position of ?, #, $, !, /, : strchr_any(text,"?#$!/:");  same speed as strchr by using SSE4.2  max length of key is 16 strchr_range(txt, key)[or findChar_range]  returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...  also same speed as strchr and max len(key) = 16 // search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF");2012/3/31 #x86opti 20 /20

×