Bouma2 talk

931 views
879 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
931
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bouma2 talk

  1. 1. A High-Performance Input-Aware Multiple String-Match Algorithm Erez Buchnik
  2. 2. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 2
  3. 3. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 3
  4. 4. The Multiple String-Match Problem• Goal: Given a set of strings and input text, find all occurrences of any of the strings in the text• Input: Set of strings L and input text M• Output: Offsets 1 ≤ i ≤ |M| where a substring of M matches any of the strings in L• Uses: AV, IPS, DPI, DNA Search etc… Page 4
  5. 5. The Multiple String-Match Problem - References• Aho-Corasick ’75• Commentz-Walter ’79• Rabin-Karp ’87• Wu-Manber ’94• Muth-Manber ’96• Hopcroft-Motwani-Ullman ’00• Dori-Landau ’06 Page 5
  6. 6. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 6
  7. 7. Stateful Approach (e.g. Aho-Corasick)• One state transition per symbol• Linear in the length of the input• Large automatons cause cache- misses and degrade performance Page 7
  8. 8. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 8
  9. 9. Guidelines• INTUITIVE: Search for ‘Hints’ of a Match Before the Full Match• REALISTIC: Use Prior Knowledge of Expected Input• SIMPLE: Trivial Match Process Page 9
  10. 10. Bouma2: Motif-Based String MatchSet of re Set of selected borestrings 2-symbols long core ek substrings trek bits bi corridor at boat book ok cooks or• Preprocessing: Map every string to its own substring: Motif Q1: How to select motifs? Page 10
  11. 11. Bouma2: Motif-Based String Match (cont.) “ r a b b i t s h a t e c o o k s “ No match No match b o a t b o o k Match Match Match b i t s c o o k s• Match: Examine symbols 2-by-2 (STATELESS); attempt full match around motif occurrences Q2: How to resolve collisions? Page 11
  12. 12. Capturing all Occurrences “ h a b i t s o f r a b b i t s “ Match Match b i t s b i t s• Even-offset occurrences and odd- offset occurrences require separate passes, but instead… Page 12
  13. 13. Upgrade #1: 2-Symbol Strides “ h a b i t s o f r a b b i t s “ Match Match Match b i t s b i t s• We map each string TWICE: once to an even-offset motif, and once to an odd-offset motif Page 13
  14. 14. Upgrade #2: Fast-Path / Slow-Path 4 14“ h a b i t s o f r a b b i t s “ 4 14 • Fast-Path: - Stateless - “Monolithic” (zero branches) - Cache-Aware (small direct-table) - SIMPLE… Page 14
  15. 15. Upgrade #2: Fast-Path / Slow-Path 4 14 4 “ h a b i t s o f r a b b i t s “ 14 Match Match Match b i t s b i t s• Slow-Path: - Memory-Efficient (pointers to original strings for comparison) - “Localized” (separate structure for every motif) Page 15
  16. 16. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 16
  17. 17. Bouma2 vs. Aho-Corasick• n – length of input• S – no. of string-matches in n• m – no. of motif-matches in n• l – length of the longest string• Match Complexities:- Aho-Corasick: O( n  S ) n- Bouma2: O(  m  l ) 2 Page 17
  18. 18. Bouma2 vs. Aho-Corasick (Speed) Bouma2 Bouma2 Slow-Path Fast-Path (Sub-Optimal) Aho-Corasick• In practice, Bouma2 is usually at least twice as fast as Aho-Corasick• Fast-path alone is 10 times faster Q3: How to optimize slow-path? Page 18
  19. 19. Bouma2 vs. Aho-Corasick (Cache) Bouma2 Cache-Misses Aho-Corasick Cache-Misses• Bouma2 exhibits 8.5 times less cache-misses than Aho-Corasick (fast-path + slow-path) Page 19
  20. 20. Bouma2 vs. Aho-Corasick (Memory)Bouma2 Bouma2 OriginalFast-Path Slow-Path Strings Aho-Corasick• Bouma2 footprint is less than 70% of Aho-Corasick for textual search (down to 35% in other cases) Page 20
  21. 21. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 21
  22. 22. Q1: How to select motifs? bo co do id or re ri rr bo re • • Even Offset co re • • co rr id or • • • • b or e • Odd Offset c or e • c or ri do r • • •• A1: Out of all 2-symbol substrings, find a minimum subset that covers all given strings (even & odd offsets) Page 22
  23. 23. Q1: How to select motifs? bo co do id or re ri rr bo re Χ √ Even Offset co re Χ √ co rr id or Χ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ• But… maybe the minimum subset is not the optimal subset? Page 23
  24. 24. Q1: How to select motifs?• Bad selection of motifs for English text searches: substrings of ‘the’ - the most common word in English at ea er he te th Even Offset th ea te r Χ Χ √ Odd Offset t he at er Χ Χ √“The good, the bad and the ugly“ in theaters nearbyNo match No match Match No match Match No match thea ter thea ter thea ter Match thea ter Page 24
  25. 25. Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151• Use input-specific occurrence statistics to optimize motif-sets• REALISTIC… Page 25
  26. 26. Q1: How to select motifs? bo co do id or re ri rr bo re √ Χ Even Offset co re √ Χ co rr id or √ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ• NOTE: After selecting the motif-set, remove redundant mappings from the final String-to-Motif mapping Page 26
  27. 27. Statistics for Motif Selection 10000000 8000000 00 00(more than 100,000) Occurrences 6000000 4000000 “rn” FF FF 2000000 0 0 10000 20000 30000 40000 50000 60000 70000 35000000 30000000 00 00(more than 40,000) 25000000 Occurrences 20000000 FF FF 15000000 “??” 10000000 5000000 0 0 10000 20000 30000 40000 50000 60000 70000• 2-symbol sequence statistics: IP traffic (top) vs. OS files (bottom) Page 27
  28. 28. Motif Selection as an ILP Problem• L: a given string-set• TL: all 2-symbol substrings of strings in L• c(t): cost-function for every t in TLMinimize  c(t )  x tTL t , whereas xt {0,1} for every t  TLSubject To: for every w  L x  assoc (w, t )  1, and  x  assoc (w, t )  1tTL t 0 tTL t 1 Page 28
  29. 29. Q2: How to resolve collisions? -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 b o re I c o re c o rridor corrid o r• A2:- Examine adjacent symbols at relative offsets to eliminate strings- New structure: The Mangled-Trie Page 29
  30. 30. The Mangled-Trie ‘or’ Motif at Offset 0 1 OTHER Resolve: NO Offset -1 MATCH ‘b’ ‘d’ NO NO ‘e’ in NO “corri” in NO ‘c’ Offset 2? MATCH Offset -6? MATCH 2 OTHER YES YES NO Resolve:MATCH Offset 2 “bore” in “corridor” in ‘e’ Offset -1 Offset -6 “core” in Offset -1 bore ‘r’ core 3 corridor NO corridor “idor” in NO Offset 3? MATCH I -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 YES ...corricorridor... “corridor” in Offset -1 1 2 3 Page 30
  31. 31. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Preprocessing in Detail• Future Work Page 31
  32. 32. Q3: How optimize slow-path?• A3:- Optimize Frequent Scenarios: Apply statistics to Mangled-Trie construction- Improve Motif-Set Quality: Avoid slow-path altogether when possible Page 32
  33. 33. More Future Work…• Adaptive System: Collect statistics “on-the-go” and improve motif-set• Faster Preprocessing: Custom Branch-and-Cut (Margot ’10)• Regular Expressions• Hardware Implementation• Bouma3?… Page 33
  34. 34. “ Search has always been about people. Its not an abstract thing. Its not a formula. Its about getting people what they need... It depends on the type of search you do—and how to take all those signals and put them together.”- Udi Manber, Google, 2008 Page 34
  35. 35. Thank You

×