A3 sec -_regular_expressions

Regular Expressions
Performance
Optimizing event capture building
better Ossim Agent plugins

About A3Sec
● AlienVault's spin-off
● Professional Services, SIEM deployments
● Alienvault's Authorized Training Center (ATC)
for Spain and LATAM
● Team of more than 25 Security Experts
● Own developments and tool integrations
● Advanced Health Check Monitoring
● Web: www.a3sec.com, Twitter: @a3sec

About Me
● David Gil <dgil@a3sec.com>
● Developer, Sysadmin, Project Manager
● Really believes in Open Source model
● Programming since he was 9 years old
● Ossim developer at its early stage
● Agent core engine (full regex) and first plugins
● Python lover :-)
● Debian package maintainer (a long, long time ago)
● Sci-Fi books reader and mountain bike rider

Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools

Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})

Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})dd
Input strings:

bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui

Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve

Regular Expressions
To RE or not to RE
right answer
○ Learning curve

● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.

Regular Expressions
To RE or not to RE
right answer
○ Learning curve

● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.

● Use standard parsing libraries!
Formats: JSON, HTML, XML, CSV, etc.

Regular Expressions
To RE or not to RE
Example: URL parsing
● regex:
^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$

● parse_url() php method:
$url = "http://username:password@hostname/path?arg=value#anchor";
print_r(parse_url($url));
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)

Regular Expressions
To RE or not to RE
But, there are a lot of reasons to use regex:
● powerful
● portable
● fast (with performance in mind)
● useful for complex patterns
● save development time
● short code
● fun :-)
● beautiful?

Regular Expressions
Basics - Characters
● d, D: digits. w, W: words. s, S: spaces
>>> re.findall('dddd-(dd)-dd', '2013-07-21')
>>> re.findall('(S+)s+(S+)', 'foo bar')

● ^, $: Begin/End of string
>>> re.findall('(d+)', 'cba3456csw')
>>> re.findall('^(d+)$', 'cba3456csw')

● . (dot): Any character:
>>> re.findall('foo(.)bar', 'foo=bar')
>>> re.findall('(...)=(...)', 'foo=bar')

Regular Expressions
Basics - Repetitions
● *, +: 0-1 or more repetitions
>>> re.findall('FO+', 'FOOOOOOOOO')
>>> re.findall('BA*R', 'BR')

● ?: 0 or 1 repetitions
>>> re.findall('colou?r', 'color')
>>> re.findall('colou?r', 'colour')

● {n}, {n,m}: N repetitions:
>>> re.findall('d{2}', '2013-07-21')
>>> re.findall('d{1,3}.d{1,3}.d{1,3}.d{1,3}','192.168.1.25')

Regular Expressions
Basics - Groups
[...]: Set of characters
>>> re.findall('[a-z]+=[a-z]+', 'foo=bar')

...|...: Alternation
>>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar')

(...) and 1, 2, ...: Group
>>> re.findall(r'(w+)=(1)', 'foo=bar')
>>> re.findall(r'(w+)=(1)', 'foo=foo')

(?P<name>...): Named group
>>> re.findall('d{4}-d{2}-(?P<day>d{2}'), '2013-07-23')

Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']

Regular Expressions
['AAAA']
['A', 'A', 'A', 'A']

● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', 'i am bold')
>>> re.findall('<(.*)>.*</(.*)>', 'i am bold')

Regular Expressions
['AAAA']
['A', 'A', 'A', 'A']

● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', 'i am bold')
>>> re.findall('<(.*)>.*</(.*)>', 'i am bold')

● Minimal matching, non-greedy
>>> re.findall('<(.*)>.*', 'i am bold')
>>> re.findall('<(.*?)>.*', 'i am bold')

Regular Expressions
Performance Tests
Different implementations of a custom
is_a_word() function:
● #1 Regexp
● #2 Char iteration
● #3 String functions

Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"

Regular Expressions
Performance Test #1
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102

Regular Expressions
Performance Test #1
1.49650502205
YES len=4
word
1.65614509583
YES len=25
1.92520785332
YES len=60
2.38850092888
YES len=120
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
1.92521882057
NOP len=50
2.39075493813
NOP len=102

If the target string is longer, the regex matching
is slower. No matter if success or fail.

Regular Expressions
Performance Test #2
for char in word:
if not char in (CHARS): return "NOP"
return "YES"

Regular Expressions
Performance Test #2
for char in word:
return "YES"
0.687522172928 YES len=4
word
1.0725839138
YES len=25
2.34717106819
YES len=60
4.31543898582
YES len=120
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
0.546499967575 NOP len=50
0.553755998611 NOP len=102

Regular Expressions
Performance Test #2
for char in word:
return "YES"
0.687522172928 YES len=4
word
1.0725839138
YES len=25
2.34717106819
YES len=60
4.31543898582
YES len=120
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
0.546499967575 NOP len=50
0.553755998611 NOP len=102

2 python nested loops if success (slow)
But fails at the same point&time (first space)

Regular Expressions
Performance Test #3
return "YES" if word.isalpha() else "NOP"

Regular Expressions
Performance Test #3

0.146447896957 YES len=4
word
0.212563037872 YES len=25
0.318686008453 YES len=60
0.493942975998 YES len=120
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
0.147103071213 NOP len=50
0.146239995956 NOP len=102

Regular Expressions
Performance Test #3

0.146447896957 YES len=4
word
0.212563037872 YES len=25
0.318686008453 YES len=60
0.493942975998 YES len=120
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
0.147103071213 NOP len=50
0.146239995956 NOP len=102

Python string functions (fast and small C loops)

Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?

Regular Expressions
Writing regex
(abc|def){2,1000} produces ...

Regular Expressions
Writing regex

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')

Regular Expressions
Writing regex

re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower
re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster

Regular Expressions
Writing regex

re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower
re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster

● Longer target string -> slower regex
matching

Regular Expressions
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

Regular Expressions
Writing regex

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)

Regular Expressions
Writing regex

TRAFFIC_(ALLOW|DROP|DENY)

Regular Expressions
Writing regex

TRAFFIC_(ALLOW|DROP|DENY)

● Use anchors (^ and $) to limit the score
re.findall(r'(ab){2}', 'abcabcabc')
re.findall(r'^(ab){2}','abcabcabc') #failures occur faster

Regular Expressions
Writing Agent plugins
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!

Regular Expressions
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!

● A plugin is a set of rules (regexp operations)
for matching log lines
○ If a plugin doesn't match a log entry, it fails in ALL its
rules!
○ Reduce the number of rules, use a [translation] table

Regular Expressions
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first

Regular Expressions
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first

● Divide and conquer
○ A plugin is configured to read from a source file, use
dedicated source files per technology
○ Also, use dedicated plugins for each technology

Regular Expressions
Tool1
Tool2
Tool3
Tool4
Tool5

20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec

/var/log/syslog
(100 logs/sec)

5 plugins with 1 rule reading /var/log/syslog
5x100 = 500 total regex/sec

Regular Expressions
Tool1
Tool2
Tool3
Tool4
Tool5

20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec

/var/log/tool1
/var/log/tool2
/var/log/tool3
/var/log/tool4
/var/log/tool5
(100 logs/sec)

5 plugins with 1 rule reading /var/log/tool{1-5}
5x20 = 100 total regex/sec (x5) Faster

Regular Expressions
Tools for testing Regex
Python:
>>> import re
>>> re.findall('(S+) (S+)', 'foo bar')
[('foo', 'bar')]
>>> result = re.search(
...
'(?P<key>w+)s*=s*(?P<value>w+)',
...
'foo=bar'
... )
>>> result.groupdict()
{ 'key': 'foo', 'value': 'bar' }

Regular Expressions
Tools for testing Regex
Regex debuggers:
● Kiki
● Kodos
Online regex testers:
● http://gskinner.com/RegExr/ (java)
● http://regexpal.com/ (javascript)
● http://rubular.com/ (ruby)
● http://www.pythonregex.com/ (python)
Online regex visualization:
● http://www.regexper.com/ (javascript)

any (?:question|doubt|comment)+?

A3Sec
web: www.a3sec.com
email: training@a3sec.com
twitter: @a3sec
Spain Head Office
C/ Aravaca, 6, Piso 2
28040 Madrid
Tlf. +34 533 09 78
México Head Office
Avda. Paseo de la Reforma, 389 Piso 10
México DF
Tlf. +52 55 5980 3547

A3 sec -_regular_expressions

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to A3 sec -_regular_expressions

Similar to A3 sec -_regular_expressions (20)

A3 sec -_regular_expressions