rspamd-fosdem

Why Rspamd?
SpamAssassin → rspamd in “Rambler” Mail

Rspamd in action
Real-time stats

Rspamd in a nutshell
• Written in plain C
• Uses non-blocking (event-driven) data processing
model
• Oriented on performance
• Plugins and rules are written in Lua
• Supports all major ﬁltering methods
• Packaged for the most of *nix systems

Major spam classes
• Fraud and scam:
• “Nigerian” mail
• Phishing (banks, social networks)
• Advertisement:
• Plain text
• Pictures, obfuscated HTML etc
• Spam from the social networks notiﬁcations

Filtering methods
Basic principles
• Policies (particularly efﬁcient against fraud)
• DNS lists (for IP and URLs)
• SPF, DKIM, DMARC
• Sender’s reputation
• Static content ﬁltering (patterns match)
• Statistics and ML (e.g. Bayesian)

Filtering methods
Difﬁculties
• Policies:
• Require network requests with unpredictable delay
• False positive (e.g. when forwarding)
• Static Content ﬁltering:
• CPU intensive
• Unlikely to be useful long-term
• Statistics:
• Not very precise
• Hard to adopt for multi-user systems

Rspamd special features
• Text lemmatisation and normalisation (snowball stemmer)
• Rspamd converts all texts to utf-8
• Support for users settings in JSON
• Possibility to have multiple metrics for a message
• Web interface
• Rspamd can load SpamAssassin rules directly
• OpenSource (BSD 2-clause)

Rspamd integration
Using Rmilter
MTA Rmilter Rspamd
Storage
Message and SMTP DATA HTTP request
Redis cache
JSON: action, symbolsDelay, modify or reject
Greylisting, limits Statistics, shared cache
✉✉✉✉
✉ ✉
🚮
Inbox Spam
✉
Reject
✉

Rspamd integration
milter
• rmilter + postﬁx/sendmail:
• supports all rspamd features (e.g. greylisting
action)
• can split inbound and outbound messages
• can whitelist reply to own messages
• can add DKIM signatures to outgoing email

Rspamd integration
Other MTA
• exim - supports rspamd internally (from 4.86)
• haraka - supports rspamd internally
• any other MTA - LDA mode when rspamc checks
message on rspamd and calls delivery agent with
altered message using Unix pipes

Development principles
• Performance and scalability
• Filtering quality and extensions
• Security
• Simplicity and documentation coverage

Rules in Rspamd
Pre-ﬁlters Filters Post-ﬁlters
Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule
Policy rules Content rules Statistics rules
Settings, ratelimit Main scan Postprocessing

Message lifecycle
RulesRules
Filters
RulesRules
Pre-ﬁlters
RulesRules
Post-ﬁlters
📩 Message
📬 Result
Wait
Wait
Wait for dependencies

Event driven model
Never blocks*
• Advantages:
✅ Can process network replies asynchronously
✅ Sends all network request simultaneously
✅ Can process many messages in parallel, scales well
• Disadvantages:
📛 Callbacks hell
⛔ Memory consumption can go wild
*almost never

Sequential processing
Traditional approach
Rule 1 Rule 2 Rule 3
DNS Hashes
Wait Wait
Processing

Event-driven processing
Rule 1 Rule 2 Rule 3
DNS Hashes
Wait
Processing

Rule 1
Rule 1
Rules
DNS Hashes
Wait
What really happens?

• Rspamd can send thousands of DNS requests:  
time: 5540.8ms real, 2427.4ms virtual, dns req: 120543

• Rspamd can send thousands of DNS requests:  
time: 5540.8ms real, 2427.4ms virtual, dns req: 120543
• For short messages waiting prevails on the
processing time:  
time: 996.140ms real, 22.000ms virtual,

Statistics in rspamd
• Uses Markov chains for statistic tokens (using
sparse bi-gramms model)
• Considers metadata in statistics: mime structure,
headers and so on
• Statistics could be stored in redis or sqlite3
• Rspamd supports per-user, per-domain statistics,
multiple classiﬁers and autolearning

Tokens generation
Quick brown fox jumps over lazy dog
1
2
3
4
1
2

Statistics example
Hard case: images spam
0%
25%
50%
75%
100%
Spam Ham
8%5%
92%95%
Correctly recognized Unrecognized

Fuzzy hashes
Introduction
• Can ﬁnd similar or close to similar messages
• Rspamd uses fuzzy logic for text and exact
hashes for attachments
• Storage engine - sqlite3

Shingles Algorithm
Quick brown fox jumps over lazy dog
w1 w2 w3
w2 w3 w4
w3 w4 w5
w4 w5 w6
w1 w2 w3
w2 w3 w4
w3 w4 w5
w4 w5 w6
N hashes
h1
h2
h3
h4
h1’
h2’
h3’
h4’

Shingles Algorithm
h1 h2 h3
h1’ h2’ h3’
h1’’ h2’’ h3’’
h1’’’’’ h2’’’’ h3’’’’
…
…
…
…
…
min
min
min
min
…
N shinglesN hash pipes

Shingles Algorithm
• Probabilistic in general
• Uses siphash with different keys
• Window size - 3 words, number of hashes - 32

Main Principles
• Always consider performance.
• If something could be skipped, skip it
• Use more speciﬁc solution if it is faster than
generic one (special FSM, asm fragments)
• If it is possible to do something fast but not 100%
accurate, try faster method

Rules Optimisation
Global optimisation
• Stop checks if a message has reached threshold
• Prioritise rules with negative weight
• Prioritise faster and more frequent rules:
• Check weight, hit probability and average time
• Use greedy sorting algorithm

Rules Optimisation
Local optimisations
• Build Abstract Syntax Tree for each rule
• Tree is periodically resorted in respect of hit
probability and execution time
• Regular expressions are optimised using PCRE
+JIT and Hyperscan (from 1.1)
• Lua code is optimised by means of LuaJIT

AST Optimisations
Branches cut
&
|C
! B
A
A = 0, B = 1, C = 0
0
1
10
1
0
Order of processing
4/6 branches skipped
(!A or B) and C

AST optimisations
N-ary operations
>
+ 2
! B
A
Check order
C D E
What we compare?
Here is our limit
Stop processing
(!A+B+C+D+E) > 2

C Library optimisations
• Accelerated base32/base64
• Optimised string functions
• Fixed size strings
• Own printf implementation
• Own logger layer
• Assembly versions for crypto ciphers and hashes

Hyperscan in Rspamd
Multi-regexp match engine

Hyperscan in Rspamd
Problem statement
• Need to match many regular expressions for the
same data:
len: 610591, time: 2492.457ms real, 882.251ms virtual
regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre
• No need to capture patterns
• Need to match both UTF8 and raw data
• Need to support Perl compatible syntax

Naive solution
• Join all expressions with ‘|’:
/abc/ + /cde/ => /(?:abc)|(?:cde)/
• Won’t work:
/abc/ + /a/ => /(?:abc)|(?:a)/ -> the second rule won’t match for ‘abc’

Hyperscan
Text
/re1/
/re2/
/re3/
/re4/
/re5/
/re6/
/re7/
Scan
Match
Match
Match

Rules and Hyperscan
From
Subject
Text Part
HTML Part
Attachment
Header RE class
Header RE class
Text Parts class
Raw Body class
Whole message
m1
m2
m3
m4
m5
m6
MIME message Hyperscan regexps

Problems with Hyperscan
• Slow compile time (like 100 times slower than
PCRE+JIT)
• Many PCRE features are unsupported (lookahead,
look-behind)
• Is not packaged for the major OSes (requires
relatively fresh boost)

Lazy hyperscan compile
HS compiler
Main
Scanners
Compiled
class
Compiled
class
Compiled
class
Use PCRE only
Compile to disk

HS compiler
Main
Scanners
Compiled
class
Compiled
class
Compiled
class
Use PCRE only
All done Broadcast message

HS compiler
Main
Scanners
Compiled
class
Compiled
class
Compiled
class
Switch to hyperscan
Loaded from cache
Wait for ‘recompile’ command

• Does not affect the starting time
• If nothing changes then scanners load pre-
compiled expressions on startup (using crypto
hashes)
• Recompile command is used to force
recompilation process:
rspamadm recompile

Unsupported features
Pre-ﬁlter mode
Text
/re1/pre
/re2/
/re3/pre
/re4/
/re5/
/re6/
/re7/
Scan
Maybe match?
Match
Match
Check
PCRE
Match!
Replaces unsupported constructs with "weaker"
supported ones, guaranteeing a match superset:

Hyperscan results
Use Hyperscan for the vast majority of RE rules:
regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre
regexp statistics: 34 pcre regexps scanned, 41 regexps matched, 8.41M bytes scanned using pcre,
9.56M bytes scanned total

Main Principles
• Use secure C programming guidelines
• Perform static analysis for the code including
special clang plugin for own printf format safety
• Deal with malicious messages
• Limit and protect DNS subsystem
• Encrypt and protect all data passed over the
network

Examples
• An email can contain thousands of objects to
check (e.g. URLs)
• SPF records could have inﬁnite recursion
• Need limits for network checks and memoization
of the results

DNS protection
• Cryptographic DNS id generation
• Socket pool with constant rotation
• Source port randomisation
• Input ﬁltering (+ IDN encoding)

Trafﬁc Encryption
• Security -> performance -> simplicity
• TLS is hard in non-blocking IO
• TLS does a lot of unnecessary (in case of
rspamd) operations increasing thus CPU load and
latency

Connection Establishment
ClientHelo
ServerHelo
Certiﬁcates⚠
ClientChangeCipher
ServerChangeCipher
Data
TLS handshake
ClientPK
Encrypted request
Encrypted reply
Validation (optional)
HTTPCrypt handshake

Connection Latency
Cached connections

Connection Latency
New connections

Why HTTPCrypt is fast
• Faster key exchange (ECDH)
• Server does not need to sign anything 
(needed for replay protection in TLS)
• Encryption is done in-place
• No replay protection for the ﬁrst message - still
suitable for idempotent scanning

Conclusions
• Rspamd is designed to work as fast as possible
• Rspamd has a rich and actively maintained
collection of rules and plugins and can re-use
existing SA rules
• It is possible to encrypt all data transferred over
the network in rspamd ecosystem

Open issues
• No Debian maintainer (any volunteers?)
• No dynamic rules updates (planned for 1.2) yet
• Rapid development (hard to keep compatibility)

Questions?
Vsevolod Stakhov
https://rspamd.com

Rspamd processes
✉
Scanning processes
Scanning processes
Scanner processes Controller Service processes
Main process
✉✉
HTTP
📬 ✉✉✉
Learn
Messages Results

Conﬁguration evolution
1. Grammar parser (lex + yacc)
⛔ Hard to manage
⛔ Hard to extend
2. XML
⛔ Unreadable
⛔ Problems with expressions (A > B)
3. UCL - universal conﬁguration language
✅ Easy to manage (looks like nginx.conf)
✅ Macro support
✅ JSON data model (can be used as JSON parser)

UCL building blocks
• Sections
• Arrays
• Variables
• Macros
• Comments
section {
key = “value”;
number = 10K;
}
upstreams = [
“localhost:80”,
“example.com:8080”,
]
static_dir = “${WWWDIR}/“;
filepath = “${CURDIR}/data”;
.include “${CONFDIR}/workers.conf”
.include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf”
.lua { print(“hey!”); }
key = value; // Single line comment
/* Multiline comment
/* can also be nested */
*/

Configuration components
Global options
Workers configuration
Scores (metrics)
Modules configuration
Statistics
• Each component is normally included
to the main configuration
• rspamd.local.conf is used to extend
configuration
• rspamd.override.conf is used to
override values in the configuration
• It is possible to use numeric
multipliers: “k/m/g” or “ms/s/m/h/d” for
time values

Lua rules
• Most of rules are deﬁned in Lua conﬁguration
• Two types of Lua rules:
• Regexp rules (look like strings)
• Lua functions (pure Lua code)

Lua rules
Some examples
-- Outlook versions that should be excluded from summary rule
local fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H'
local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H'
-- Summary rule for forged outlook
reconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s',
forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid)
• Regexp rule
• Lua rule
rspamd_config.R_EMPTY_IMAGE = function(task)
local tp = task:get_text_parts() -- get text parts in a message
for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs`
if p:is_html() then -- if the current part is html part
local hc = p:get_html() -- we get HTML context
local len = p:get_length() -- and part's length
if len < 50 then -- if we have a part that has less than 50 bytes of text
local images = hc:get_images() -- then we check for HTML images
if images then -- if there are images
for _,i in ipairs(images) do -- then iterate over images in the part
if i['height'] + i['width'] >= 400 then -- if we have a large image
return true -- add symbol
end
end
end
end
end
end
end

Pure LUA functions
Review
• Are very powerful
• Have access to all information from rspamd via lua
API: https://rspamd.com/doc/lua/
• Are very fast since C <-> LUA interaction is cheap
• Can use zero-copy objects called rspamd{text} to
avoid copying when moving data between C and
LUA

Pure LUA functions
• Variables:
• Conditionals:
• Loops:
• Tables:
• Functions:
• Closures:
local ret = false -- Generic variable
local rules = {} -- Empty table
local rspamd_logger = require “rspamd_logger" -- Load rspamd module
if not ret then -- can use ‘not’, ‘and’, ‘or’ here
…
elseif ret ~= 10 then -- note ~= for ‘not equal’ operator
end
for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = value
for _,i in ipairs(images) do … end -- Iterate over array table a[1] = value
for i=1,10 do … end -- Count from 1 to 10
local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1
another_key = function(task) … end, -- Functions can be values
[2] = {} -- Other tables can be values
} -- Can have both numbers and strings as key and anything as values
local function something(task) -- Normal definition
local cb = function(data) -- Functions can be nested
…
end
end
local function gen_closure(option)
local ret = false -- Local variable
return function(task)
task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here
end
end
rspamd_config.SYMBOL = gen_closure(‘some_option’)

Pure LUA functions
Generic recommendations
• Use local whenever possible (otherwise, global
variables are expensive)
• Callbacks, closures and recursion are generally cheap
(when using LuaJIT)
• Do not mix string and number keys in tables, that
makes them hard to iterate
• ipairs and pairs are not equal
• Strings are constant in LUA

Regexp rules
Types
• Can work with the following elements:
• Headers: Message-Id=/^something$/H
• Mime parts: /some word/P
• Raw messages: /some pattern/M
• URLs: /example.com/U
• Some new ﬂags are added:
• UTF8 ﬂag: /u

Regexp rules
Generic information
• Can be combined using the following operators:
• AND: /something/P && Subject=/some/H
• OR: /something/P || Subject=/some/H
• NOT: !/something/P
• PLUS: /A/P + /B/P + /C/P >= 2
• Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS
• Braces can change priority: !A AND (B OR C)

Regexp rules
Performance considerations
• Avoid message regexps at any cost (use trie
instead)
• Regexp expressions are highly optimised in rspamd
and unnecessary evaluations are not performed
• UTF regexps are more expensive than default ones
(but could be useful sometimes)
• Always use the appropriate type of expression (e.g.
url for links and part for textual content)

Trie matching
• Perfect for fast raw message and text pattern
matching
• Scales almost linearly with input size (aho-corasic
algorithm)
• Can handle thousands and hundreds of thousands
of patterns (is a base for all antivirus scanners)
• Highly optimised for 64 bit systems

rspamd-fosdem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to rspamd-fosdem

Similar to rspamd-fosdem (20)

rspamd-fosdem