Regular Expressions
Performance
Optimizing event capture building
better Ossim Agent plugins
About A3Sec
● AlienVault's spin-off
● Professional Services, SIEM deployments
● Alienvault's Authorized Training Center (ATC)
for Spain and LATAM
● Team of more than 25 Security Experts
● Own developments and tool integrations
● Advanced Health Check Monitoring
● Web: www.a3sec.com, Twitter: @a3sec
About Me
● David Gil <dgil@a3sec.com>
● Developer, Sysadmin, Project Manager
● Really believes in Open Source model
● Programming since he was 9 years old
● Ossim developer at its early stage
● Agent core engine (full regex) and first plugins
● Python lover :-)
● Debian package maintainer (a long, long time ago)
● Sci-Fi books reader and mountain bike rider
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})
Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})dd
Input strings:

bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui
Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})dd
Input strings:

bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve
Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve

● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.
Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve

● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.

● Use standard parsing libraries!
Formats: JSON, HTML, XML, CSV, etc.
Regular Expressions
To RE or not to RE
Example: URL parsing
● regex:
^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$

● parse_url() php method:
$url = "http://username:password@hostname/path?arg=value#anchor";
print_r(parse_url($url));
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)
Regular Expressions
To RE or not to RE
But, there are a lot of reasons to use regex:
● powerful
● portable
● fast (with performance in mind)
● useful for complex patterns
● save development time
● short code
● fun :-)
● beautiful?
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Basics - Characters
● d, D: digits. w, W: words. s, S: spaces
>>> re.findall('dddd-(dd)-dd', '2013-07-21')
>>> re.findall('(S+)s+(S+)', 'foo bar')

● ^, $: Begin/End of string
>>> re.findall('(d+)', 'cba3456csw')
>>> re.findall('^(d+)$', 'cba3456csw')

● . (dot): Any character:
>>> re.findall('foo(.)bar', 'foo=bar')
>>> re.findall('(...)=(...)', 'foo=bar')
Regular Expressions
Basics - Repetitions
● *, +: 0-1 or more repetitions
>>> re.findall('FO+', 'FOOOOOOOOO')
>>> re.findall('BA*R', 'BR')

● ?: 0 or 1 repetitions
>>> re.findall('colou?r', 'color')
>>> re.findall('colou?r', 'colour')

● {n}, {n,m}: N repetitions:
>>> re.findall('d{2}', '2013-07-21')
>>> re.findall('d{1,3}.d{1,3}.d{1,3}.d{1,3}','192.168.1.25')
Regular Expressions
Basics - Groups
[...]: Set of characters
>>> re.findall('[a-z]+=[a-z]+', 'foo=bar')

...|...: Alternation
>>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar')

(...) and 1, 2, ...: Group
>>> re.findall(r'(w+)=(1)', 'foo=bar')
>>> re.findall(r'(w+)=(1)', 'foo=foo')

(?P<name>...): Named group
>>> re.findall('d{4}-d{2}-(?P<day>d{2}'), '2013-07-23')
Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']

● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']

● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')

● Minimal matching, non-greedy
>>> re.findall('<(.*)>.*', '<B>i am bold</B>')
>>> re.findall('<(.*?)>.*', '<B>i am bold</B>')
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Performance Tests
Different implementations of a custom
is_a_word() function:
● #1 Regexp
● #2 Char iteration
● #3 String functions
Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
wordlongerthanpreviosoneplusan..
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102
not a word, just a phrase bigg..
Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
wordlongerthanpreviosoneplusan..
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102
not a word, just a phrase bigg..

If the target string is longer, the regex matching
is slower. No matter if success or fail.
Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4
word
1.0725839138
YES len=25
wordlongerthanpreviousone..
2.34717106819
YES len=60
wordlongerthanpreviosoneplusan..
4.31543898582
YES len=120
wordlongerthanpreviosoneplusan..
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
not a word, just a phrase..
0.546499967575 NOP len=50
not a word, just a phrase bigg..
0.553755998611 NOP len=102
not a word, just a phrase bigg..
Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4
word
1.0725839138
YES len=25
wordlongerthanpreviousone..
2.34717106819
YES len=60
wordlongerthanpreviosoneplusan..
4.31543898582
YES len=120
wordlongerthanpreviosoneplusan..
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
not a word, just a phrase..
0.546499967575 NOP len=50
not a word, just a phrase bigg..
0.553755998611 NOP len=102
not a word, just a phrase bigg..

2 python nested loops if success (slow)
But fails at the same point&time (first space)
Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4
word
0.212563037872 YES len=25
wordlongerthanpreviousone..
0.318686008453 YES len=60
wordlongerthanpreviosoneplusan..
0.493942975998 YES len=120
wordlongerthanpreviosoneplusan..
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
not a word, just a phrase..
0.147103071213 NOP len=50
not a word, just a phrase bigg..
0.146239995956 NOP len=102
not a word, just a phrase bigg..
Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4
word
0.212563037872 YES len=25
wordlongerthanpreviousone..
0.318686008453 YES len=60
wordlongerthanpreviosoneplusan..
0.493942975998 YES len=120
wordlongerthanpreviosoneplusan..
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
not a word, just a phrase..
0.147103071213 NOP len=50
not a word, just a phrase bigg..
0.146239995956 NOP len=102
not a word, just a phrase bigg..

Python string functions (fast and small C loops)
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower
re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower
re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster

● Longer target string -> slower regex
matching
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
TRAFFIC_(ALLOW|DROP|DENY)
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
TRAFFIC_(ALLOW|DROP|DENY)

● Use anchors (^ and $) to limit the score
re.findall(r'(ab){2}', 'abcabcabc')
re.findall(r'^(ab){2}','abcabcabc') #failures occur faster
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Performance Strategies
Writing Agent plugins
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!
Regular Expressions
Performance Strategies
Writing Agent plugins
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!

● A plugin is a set of rules (regexp operations)
for matching log lines
○ If a plugin doesn't match a log entry, it fails in ALL its
rules!
○ Reduce the number of rules, use a [translation] table
Regular Expressions
Performance Strategies
Writing Agent plugins
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first
Regular Expressions
Performance Strategies
Writing Agent plugins
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first

● Divide and conquer
○ A plugin is configured to read from a source file, use
dedicated source files per technology
○ Also, use dedicated plugins for each technology
Regular Expressions
Performance Strategies
Tool1
Tool2
Tool3
Tool4
Tool5

20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec

/var/log/syslog
(100 logs/sec)

5 plugins with 1 rule reading /var/log/syslog
5x100 = 500 total regex/sec
Regular Expressions
Performance Strategies
Tool1
Tool2
Tool3
Tool4
Tool5

20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec

/var/log/tool1
/var/log/tool2
/var/log/tool3
/var/log/tool4
/var/log/tool5
(100 logs/sec)

5 plugins with 1 rule reading /var/log/tool{1-5}
5x20 = 100 total regex/sec (x5) Faster
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Tools for testing Regex
Python:
>>> import re
>>> re.findall('(S+) (S+)', 'foo bar')
[('foo', 'bar')]
>>> result = re.search(
...
'(?P<key>w+)s*=s*(?P<value>w+)',
...
'foo=bar'
... )
>>> result.groupdict()
{ 'key': 'foo', 'value': 'bar' }
Regular Expressions
Tools for testing Regex
Regex debuggers:
● Kiki
● Kodos
Online regex testers:
● http://gskinner.com/RegExr/ (java)
● http://regexpal.com/ (javascript)
● http://rubular.com/ (ruby)
● http://www.pythonregex.com/ (python)
Online regex visualization:
● http://www.regexper.com/ (javascript)
any (?:question|doubt|comment)+?
A3Sec
web: www.a3sec.com
email: training@a3sec.com
twitter: @a3sec
Spain Head Office
C/ Aravaca, 6, Piso 2
28040 Madrid
Tlf. +34 533 09 78
México Head Office
Avda. Paseo de la Reforma, 389 Piso 10
México DF
Tlf. +52 55 5980 3547

A3 sec -_regular_expressions

  • 1.
    Regular Expressions Performance Optimizing eventcapture building better Ossim Agent plugins
  • 2.
    About A3Sec ● AlienVault'sspin-off ● Professional Services, SIEM deployments ● Alienvault's Authorized Training Center (ATC) for Spain and LATAM ● Team of more than 25 Security Experts ● Own developments and tool integrations ● Advanced Health Check Monitoring ● Web: www.a3sec.com, Twitter: @a3sec
  • 3.
    About Me ● DavidGil <dgil@a3sec.com> ● Developer, Sysadmin, Project Manager ● Really believes in Open Source model ● Programming since he was 9 years old ● Ossim developer at its early stage ● Agent core engine (full regex) and first plugins ● Python lover :-) ● Debian package maintainer (a long, long time ago) ● Sci-Fi books reader and mountain bike rider
  • 4.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 6.
    Regular Expressions What isa regex? Regular expression: (bb|[^b]{2})
  • 7.
    Regular Expressions What isa regex? Regular expression: (bb|[^b]{2})dd Input strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui
  • 8.
    Regular Expressions What isa regex? Regular expression: (bb|[^b]{2})dd Input strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui
  • 9.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 10.
    Regular Expressions To REor not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve
  • 11.
    Regular Expressions To REor not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve ● Python string functions are small C loops: super fast! ○ beginswith(), endswith(), split(), etc.
  • 12.
    Regular Expressions To REor not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve ● Python string functions are small C loops: super fast! ○ beginswith(), endswith(), split(), etc. ● Use standard parsing libraries! Formats: JSON, HTML, XML, CSV, etc.
  • 13.
    Regular Expressions To REor not to RE Example: URL parsing ● regex: ^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$ ● parse_url() php method: $url = "http://username:password@hostname/path?arg=value#anchor"; print_r(parse_url($url)); ( [scheme] => http [host] => hostname [user] => username [pass] => password [path] => /path [query] => arg=value [fragment] => anchor )
  • 14.
    Regular Expressions To REor not to RE But, there are a lot of reasons to use regex: ● powerful ● portable ● fast (with performance in mind) ● useful for complex patterns ● save development time ● short code ● fun :-) ● beautiful?
  • 15.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 16.
    Regular Expressions Basics -Characters ● d, D: digits. w, W: words. s, S: spaces >>> re.findall('dddd-(dd)-dd', '2013-07-21') >>> re.findall('(S+)s+(S+)', 'foo bar') ● ^, $: Begin/End of string >>> re.findall('(d+)', 'cba3456csw') >>> re.findall('^(d+)$', 'cba3456csw') ● . (dot): Any character: >>> re.findall('foo(.)bar', 'foo=bar') >>> re.findall('(...)=(...)', 'foo=bar')
  • 17.
    Regular Expressions Basics -Repetitions ● *, +: 0-1 or more repetitions >>> re.findall('FO+', 'FOOOOOOOOO') >>> re.findall('BA*R', 'BR') ● ?: 0 or 1 repetitions >>> re.findall('colou?r', 'color') >>> re.findall('colou?r', 'colour') ● {n}, {n,m}: N repetitions: >>> re.findall('d{2}', '2013-07-21') >>> re.findall('d{1,3}.d{1,3}.d{1,3}.d{1,3}','192.168.1.25')
  • 18.
    Regular Expressions Basics -Groups [...]: Set of characters >>> re.findall('[a-z]+=[a-z]+', 'foo=bar') ...|...: Alternation >>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar') (...) and 1, 2, ...: Group >>> re.findall(r'(w+)=(1)', 'foo=bar') >>> re.findall(r'(w+)=(1)', 'foo=foo') (?P<name>...): Named group >>> re.findall('d{4}-d{2}-(?P<day>d{2}'), '2013-07-23')
  • 19.
    Regular Expressions Greedy &Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A']
  • 20.
    Regular Expressions Greedy &Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] ● An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*</.*>', '<B>i am bold</B>') >>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
  • 21.
    Regular Expressions Greedy &Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] ● An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*</.*>', '<B>i am bold</B>') >>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>') ● Minimal matching, non-greedy >>> re.findall('<(.*)>.*', '<B>i am bold</B>') >>> re.findall('<(.*?)>.*', '<B>i am bold</B>')
  • 22.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 23.
    Regular Expressions Performance Tests Differentimplementations of a custom is_a_word() function: ● #1 Regexp ● #2 Char iteration ● #3 String functions
  • 24.
    Regular Expressions Performance Test#1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP"
  • 25.
    Regular Expressions Performance Test#1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg..
  • 26.
    Regular Expressions Performance Test#1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg.. If the target string is longer, the regex matching is slower. No matter if success or fail.
  • 27.
    Regular Expressions Performance Test#2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES"
  • 28.
    Regular Expressions Performance Test#2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg..
  • 29.
    Regular Expressions Performance Test#2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg.. 2 python nested loops if success (slow) But fails at the same point&time (first space)
  • 30.
    Regular Expressions Performance Test#3 def is_a_word(word): return "YES" if word.isalpha() else "NOP"
  • 31.
    Regular Expressions Performance Test#3 def is_a_word(word): return "YES" if word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg..
  • 32.
    Regular Expressions Performance Test#3 def is_a_word(word): return "YES" if word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg.. Python string functions (fast and small C loops)
  • 33.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 34.
    Regular Expressions Performance Strategies Writingregex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
  • 35.
    Regular Expressions Performance Strategies Writingregex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ...
  • 36.
    Regular Expressions Performance Strategies Writingregex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')
  • 37.
    Regular Expressions Performance Strategies Writingregex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster
  • 38.
    Regular Expressions Performance Strategies Writingregex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster ● Longer target string -> slower regex matching
  • 39.
    Regular Expressions Performance Strategies Writingregex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi)
  • 40.
    Regular Expressions Performance Strategies Writingregex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
  • 41.
    Regular Expressions Performance Strategies Writingregex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY)
  • 42.
    Regular Expressions Performance Strategies Writingregex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY) ● Use anchors (^ and $) to limit the score re.findall(r'(ab){2}', 'abcabcabc') re.findall(r'^(ab){2}','abcabcabc') #failures occur faster
  • 43.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 44.
    Regular Expressions Performance Strategies WritingAgent plugins ● A new process is forked for each loaded plugin ○ Use the plugins that you really need!
  • 45.
    Regular Expressions Performance Strategies WritingAgent plugins ● A new process is forked for each loaded plugin ○ Use the plugins that you really need! ● A plugin is a set of rules (regexp operations) for matching log lines ○ If a plugin doesn't match a log entry, it fails in ALL its rules! ○ Reduce the number of rules, use a [translation] table
  • 46.
    Regular Expressions Performance Strategies WritingAgent plugins ● Alphabetical order for rule matching ○ Order your rules by priority, pattern most likely to match first
  • 47.
    Regular Expressions Performance Strategies WritingAgent plugins ● Alphabetical order for rule matching ○ Order your rules by priority, pattern most likely to match first ● Divide and conquer ○ A plugin is configured to read from a source file, use dedicated source files per technology ○ Also, use dedicated plugins for each technology
  • 48.
    Regular Expressions Performance Strategies Tool1 Tool2 Tool3 Tool4 Tool5 20logs/sec 20 logs/sec 20 logs/sec 20 logs/sec 20 logs/sec /var/log/syslog (100 logs/sec) 5 plugins with 1 rule reading /var/log/syslog 5x100 = 500 total regex/sec
  • 49.
    Regular Expressions Performance Strategies Tool1 Tool2 Tool3 Tool4 Tool5 20logs/sec 20 logs/sec 20 logs/sec 20 logs/sec 20 logs/sec /var/log/tool1 /var/log/tool2 /var/log/tool3 /var/log/tool4 /var/log/tool5 (100 logs/sec) 5 plugins with 1 rule reading /var/log/tool{1-5} 5x20 = 100 total regex/sec (x5) Faster
  • 50.
    Summary 1. What isa regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 51.
    Regular Expressions Tools fortesting Regex Python: >>> import re >>> re.findall('(S+) (S+)', 'foo bar') [('foo', 'bar')] >>> result = re.search( ... '(?P<key>w+)s*=s*(?P<value>w+)', ... 'foo=bar' ... ) >>> result.groupdict() { 'key': 'foo', 'value': 'bar' }
  • 52.
    Regular Expressions Tools fortesting Regex Regex debuggers: ● Kiki ● Kodos Online regex testers: ● http://gskinner.com/RegExr/ (java) ● http://regexpal.com/ (javascript) ● http://rubular.com/ (ruby) ● http://www.pythonregex.com/ (python) Online regex visualization: ● http://www.regexper.com/ (javascript)
  • 53.
  • 54.
    A3Sec web: www.a3sec.com email: training@a3sec.com twitter:@a3sec Spain Head Office C/ Aravaca, 6, Piso 2 28040 Madrid Tlf. +34 533 09 78 México Head Office Avda. Paseo de la Reforma, 389 Piso 10 México DF Tlf. +52 55 5980 3547