Bouma2  Erez Buchnik   February-2012
”If you can raed tihs,tehn you are prbbolay not a sttae-mhciane.”
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
The Multiple Exact String-Match Problem “Given a string-set L ⊆         Σ ∗   and an input stream WI ∈ Σ∗, find all occurr...
References• Aho-Corasick ’75• Commentz-Walter ’79• Rabin-Karp ’87• Wu-Manber ’94• Muth-Manber ’96• Hopcroft-Motwani-Ullman...
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
Aho-Corasick                                                     [^flda]                                             0    ...
Wu-Manber     SKIPfe 0        ffead 0        ladan 0        danda 0        adaov 0        fovff    1fo 1la    1..    2    ...
Rabin-Karp 0 1 2 3 0 4 5 6    lad   ffe   fov 7 8    dan   ada 910 01112             f f e      f o v   l a d   d a n   a ...
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
Bouma2: Motif-Based String Match Set of                           Set of selected            bore          re strings     ...
Bouma2: Motif-Based String Match      “ r a b b i t s       h a t e        c o o k s “                        No match    ...
Capturing all Occurrences“ h a b i t s        o f   r a b b i t s “             Match                  Match       b i t s...
Upgrade #1: 2-Symbol Strides “ h a b i t s          o f   r a b b i t s “      Match     Match                  Match     ...
Upgrade #2: Fast-Path / Slow-Path          4                      14 “ h a b i t s    o f   r a b b i t s “   4           ...
Upgrade #2: Fast-Path / Slow-Path                     4                      14     4       “ h a b i t s       o f   r a ...
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
Bouma2 vs. Aho-Corasick• n – length of input• S – no. of string-matches in n• P – Probability of motif-match• l – length o...
Benchmark-   Performed against the Snort implementation of Aho-Corasick-   Tested with 1GB of genuine IP traffic recorded ...
Benchmark – Bouma2 vs. Snort AC (Throughput)Throughput(Mbit/sec)3,500.003,000.002,500.002,000.00                          ...
Benchmark – Bouma2 vs. Snort AC (Memory)      - Snort creates several AC instances, which are pre-filtered by port      - ...
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
Q1: How to select motifs?                             bo co do id or re ri rr                bo re         •              ...
Q1: How to select motifs?                          bo co do id or re ri rr             bo re         Χ              √     ...
Q1: How to select motifs? Bad selection of motifs for English text searches: substrings of ‘the’ - the most common word in...
Q1: How to select motifs?       2-Symbol Sequence   Occurrence Probability                 bo        0.0002               ...
Q1: How to select motifs?                             bo co do id or re ri rr                bo re         √              ...
Statistics for Motif Selection                      10000000                                     00 00(more than 100,000) ...
Motif Selection as an ILP Problem• L: a given string-set• TL: all 2-symbol substrings of strings in L• c(t): cost-function...
Q2: How to resolve collisions?                -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6                     b          o   re       ...
The Mangled-Trie                            „or‟ Motif at Offset 0                                  1                    O...
Q3: How to optimize slow-path?• A3:- Optimize Frequent Scenarios: Apply statistics to Mangled-Trie construction- Improve M...
Agenda•   Problem•   Existing Solutions•   Bouma2 – Model•   Comparisons•   Algorithm Design in Detail•   Discussion
Bouma2:Hash-FunctionsRevisited        Erez Buchnik         March-2012
Hash Functions    What is a Hash-Function?    “A hash function is any algorithm or subroutine that maps large data sets of...
Bouma2 defines a hash-function:-   A tailored, optimized mapping of    strings to their own substrings.-   Collision-resol...
The Multiple Exact String-Match Problem“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, findall occurrences of any ...
The Multiple Exact String-Match Problem“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, findall occurrences of any ...
The Multiple Exact String-Match Problem  “Given a string-set L ⊆ Σ∗  and an input stream WI ∈ Σ∗,  find all occurrences of...
Upcoming SlideShare
Loading in …5
×

Bouma2

759 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
759
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This template can be used as a starter file to give updates for project milestones.SectionsRight-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.NotesUse the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation. Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)Coordinated colors Pay particular attention to the graphs, charts, and text boxes.Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.Graphics, tables, and graphsKeep it simple: If possible, use consistent, non-distracting styles and colors.Label all graphs and tables.
  • This template can be used as a starter file to give updates for project milestones.SectionsRight-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.NotesUse the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation. Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)Coordinated colors Pay particular attention to the graphs, charts, and text boxes.Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.Graphics, tables, and graphsKeep it simple: If possible, use consistent, non-distracting styles and colors.Label all graphs and tables.
  • Bouma2

    1. 1. Bouma2 Erez Buchnik February-2012
    2. 2. ”If you can raed tihs,tehn you are prbbolay not a sttae-mhciane.”
    3. 3. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    4. 4. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    5. 5. The Multiple Exact String-Match Problem “Given a string-set L ⊆ Σ ∗ and an input stream WI ∈ Σ∗, find all occurrences of any of the strings in L that appear in WI”Uses: AV, IPS, DPI, DNA Search etc...
    6. 6. References• Aho-Corasick ’75• Commentz-Walter ’79• Rabin-Karp ’87• Wu-Manber ’94• Muth-Manber ’96• Hopcroft-Motwani-Ullman ’00• Dori-Landau ’06
    7. 7. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    8. 8. Aho-Corasick [^flda] 0 l f d a 1 7 10 13 f o a a d 2 4 8 11 14 e v d n a 3 5 9 12 15 f f e f o v l a d d a n a d a
    9. 9. Wu-Manber SKIPfe 0 ffead 0 ladan 0 danda 0 adaov 0 fovff 1fo 1la 1.. 2 f f e f o v l a d d a n a d a
    10. 10. Rabin-Karp 0 1 2 3 0 4 5 6 lad ffe fov 7 8 dan ada 910 01112 f f e f o v l a d d a n a d a
    11. 11. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    12. 12. Bouma2: Motif-Based String Match Set of Set of selected bore re strings 2-symbols long core ek substrings trek bits bi corridor at boat book ok cooks orPreprocessing: Map every string to its ownsubstring: Motif Q1: How to select motifs?
    13. 13. Bouma2: Motif-Based String Match “ r a b b i t s h a t e c o o k s “ No match No match b o a t b o o k Match Match Match b i t s c o o k sMatch: Examine symbols 2-by-2(STATELESS, Consume-Order Agnostic);attempt full match around motif occurrences Q2: How to resolve collisions?
    14. 14. Capturing all Occurrences“ h a b i t s o f r a b b i t s “ Match Match b i t s b i t sEven-offset occurrences and odd-offsetoccurrences require separate passes, butinstead...
    15. 15. Upgrade #1: 2-Symbol Strides “ h a b i t s o f r a b b i t s “ Match Match Match b i t s b i t s• We map each string TWICE: once to an even-offset motif, and once to an odd- offset motif
    16. 16. Upgrade #2: Fast-Path / Slow-Path 4 14 “ h a b i t s o f r a b b i t s “ 4 14Fast-Path:- Stateless (agnostic to consume-order)- “Monolithic” (zero branches)- Cache-Aware (small direct-table)- SIMPLE...
    17. 17. Upgrade #2: Fast-Path / Slow-Path 4 14 4 “ h a b i t s o f r a b b i t s “ 14 Match Match Match b i t s b i t sSlow-Path:- Memory-Efficient (pointers to original strings forcomparison)- “Localized” (separate structure for every motif)
    18. 18. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    19. 19. Bouma2 vs. Aho-Corasick• n – length of input• S – no. of string-matches in n• P – Probability of motif-match• l – length of the longest stringMatch Complexities:- Aho-Corasick: O( n S )- Bouma2: O(n (0.5 P (l 2)))
    20. 20. Benchmark- Performed against the Snort implementation of Aho-Corasick- Tested with 1GB of genuine IP traffic recorded at an ISP site- Database included 4,841 unique strings extracted from Snort rules, 3 bytes long or longer- Aggregate size of database strings: 98,546 bytes- Tested using Snort source-code merged with Bouma2 over Intel Core2 Duo 2.53GHz with 1.95GB RAM running XP SP3- Profiled with Visual Studio 2010 Sampling Profiler- For Bouma2, three different motif-selection methods were compared:B2-M (Minimum): Minimum motifsB2-RS (Rare in Strings): Prefer motifs that occur less times within thedatabase stringsB2-RI (Rare in Input): Prefer motifs that are expected to occur less times in theinput (based on statistics over one third of the input)
    21. 21. Benchmark – Bouma2 vs. Snort AC (Throughput)Throughput(Mbit/sec)3,500.003,000.002,500.002,000.00 AC B2-M B2-RS1,500.00 B2-RI1,000.00 500.00 Total String Size 0.00 (bytes) 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
    22. 22. Benchmark – Bouma2 vs. Snort AC (Memory) - Snort creates several AC instances, which are pre-filtered by port - The comparison was done against a single Bouma2 instanceMemoryConsumption(bytes)50,000,00040,000,00030,000,000 AC B2-M B2-RS20,000,000 B2-RI10,000,000 Total 0 String Size 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 (bytes)
    23. 23. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    24. 24. Q1: How to select motifs? bo co do id or re ri rr bo re • • Even Offset co re • • co rr id or • • • • b or e • Odd Offset c or e • c or ri do r • • •• A1: Out of all 2-symbol substrings, find a minimum subset that covers all given strings (even & odd offsets)
    25. 25. Q1: How to select motifs? bo co do id or re ri rr bo re Χ √ Even Offset co re Χ √ co rr id or Χ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ• But... maybe the minimum subset is not the optimal subset?
    26. 26. Q1: How to select motifs? Bad selection of motifs for English text searches: substrings of ‘the’ - the most common word in English at ea er he te th Even Offset th ea te r Χ Χ √ Odd Offset t he at er Χ Χ √“The good, the bad and the ugly“ in theaters nearbyNo match No match Match No match Match No match thea ter thea ter thea ter Match thea ter
    27. 27. Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151• Use input-specific occurrence statistics to optimize motif-sets• REALISTIC...
    28. 28. Q1: How to select motifs? bo co do id or re ri rr bo re √ Χ Even Offset co re √ Χ co rr id or √ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ• NOTE: After selecting the motif-set, remove redundant mappings from the final String-to- Motif mapping
    29. 29. Statistics for Motif Selection 10000000 00 00(more than 100,000) 8000000 Occurrences 6000000 4000000 “rn” FF FF 2000000 0 0 10000 20000 30000 40000 50000 60000 70000 35000000 30000000 00 00(more than 40,000) 25000000 Occurrences 20000000 15000000 FF FF 10000000 “??” 5000000 0 0 10000 20000 30000 40000 50000 60000 70000• 2-symbol sequence statistics: IP traffic (top) vs. OS files (bottom)
    30. 30. Motif Selection as an ILP Problem• L: a given string-set• TL: all 2-symbol substrings of strings in L• c(t): cost-function for every t in TLMinimize c(t ), xt t TL whereas xt {0,1} every for t TLSubject To: for every w L xt assoc0 (w, t ) 1 , and xt assoc1 (w, t ) 1 t TL t TL
    31. 31. Q2: How to resolve collisions? -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 b o re I c o re c o rridor• A2: corrid o r- New structure: The Mangled-Trie- Examine adjacent symbols at relative offsets to eliminate strings- The Mangled-Trie itself dictates where to look next (instead of following a strict left-to-right sequence)
    32. 32. The Mangled-Trie „or‟ Motif at Offset 0 1 OTHER Resolve: NO Offset -1 MATCH „b‟ „d‟ NO NO „e‟ in NO “corri” in NO „c‟ Offset 2? MATCH Offset -6? MATCH 2 OTHER YES YES NO Resolve:MATCH Offset 2 “bore” in “corridor” in „e‟ Offset -1 Offset -6 “core” in Offset -1 bore „r‟ core 3 corridor NO corridor “idor” in NO Offset 3? MATCH I -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 YES ...corricorridor... “corridor” in Offset -1 1 2 3
    33. 33. Q3: How to optimize slow-path?• A3:- Optimize Frequent Scenarios: Apply statistics to Mangled-Trie construction- Improve Motif-Set Quality: Avoid slow-path altogether when possible
    34. 34. Agenda• Problem• Existing Solutions• Bouma2 – Model• Comparisons• Algorithm Design in Detail• Discussion
    35. 35. Bouma2:Hash-FunctionsRevisited Erez Buchnik March-2012
    36. 36. Hash Functions What is a Hash-Function? “A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length. ... The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. ” What input should we expect? What is a GOOD (non-cryptographic) Hash-Function? “A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. ”
    37. 37. Bouma2 defines a hash-function:- A tailored, optimized mapping of strings to their own substrings.- Collision-resolving is also optimized, based on relative offset information
    38. 38. The Multiple Exact String-Match Problem“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, findall occurrences of any of the strings in L that appear in WI”FACT: The definition of the problem DOESNOT imply that we must scan the input fromleft to right, or in any other order.
    39. 39. The Multiple Exact String-Match Problem“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, findall occurrences of any of the strings in L that appear in WI”CLAIM: Algorithms that impose aconsume-order constraint are in generalless efficient than algorithms that arefree of this constraint.
    40. 40. The Multiple Exact String-Match Problem “Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find all occurrences of 5000 any of the strings in L Naïve that appear in WI” Approach 1500Which dominant factor should wechoose when designing an Aho-Corasickefficient string-match 15algorithm?... Bouma2

    ×