SlideShare a Scribd company logo
1 of 65
Download to read offline
Vsevolod Stakhov
https://rspamd.com
Why rspamd?
A real example
Rspamd in nutshell
• Uses multiple rules to evaluate messages scores
• Is written in C
• Uses event driven processing model
• Supports plugins in LUA
• Has self-contained management web interface
Design goals
• Orientation on the mass mail processing
• Performance is the cornerstone of the whole
project
• State-of-art techniques to filter spam
• Prefer dynamic filters (statistics, hashes, DNS lists
and so on) to static ones (plain regexp)
Part I: Architecture
Event driven processing
Never blocks*
• Pros:
✅ Can process rules while waiting for network services
✅ Can send all network requests simultaneously
✅ Can handle multiple messages within the same process
• Cons:
📛 Callbacks hell (hard development)
⛔ Hard to limit memory usage due to unlimited concurrency
*almost all the time
Sequential processing
Traditional approach
Rule 1 Rule 2 Rule 3
DNS Hashes
Wait Wait
Timeline
Event driven model
Rspamd approach
Rule 1 Rule 2 Rule 3
DNS Hashes
Wait
Timeline
Rule 1
Rule 1
Rules
DNS Hashes
Wait
Event driven model
What happens in the real life
Event driven model
Some measurements
• Rspamd can send hundred thousands of DNS
requests per second (RBL, URI blacklists, custom
DNS lists): time: 5540.8ms real, 2427.4ms virtual, dns req: 120543
• For small messages (which are 99% of typical mail)
network processing is hundreds times more expensive
than direct processing: time: 996.140ms real, 22.000ms virtual,
• Event model scales very well allowing highest
possible concurrency level within a single process (no
locking is needed normally)
Real message processing
We need to go deeper
RulesRules
Filters
RulesRules
Pre-filters
RulesRules
Post-filters
📩 Message
📬 Result
Wait
Wait
Wait (dependencies)
Real message processing
We need to go deeper
• Pre filters are used to evaluate message or to
reject/accept it early (e.g. greylisting)
• Normal rules add scores (positive or negative)
• Post filters combine rules and adjust scores if
needed (e.g. composite rules)
• Normal rules can also depend on each other
(additional waiting)
Rspamd processes
Overview
Scanning processes
Scanning processes
Scaner processes Controller Service processes
Main process
✉✉✉
HTTP
📬 ✉✉✉
Learn
Messages Results
Main process
One to rule them all…
• Reads configuration
• Manages worker processes
• Listens on sockets
• Opens and reopen log files
• Handles dead workers
• Handles signals
• Reloads configuration
• Handle command line
Main process
⬇
Process ProcessProcess
Process
📃Logs
Listen sockets
📝Config
⬇ ⬇
Signals
Scanner process
• Scans messages and returns result
• Uses HTTP for operations
• Reply format is JSON
• Has SA compatibility protocol
Controller worker
• Provides data for web interface (acts as HTTP server
for AJAX requests and serving static files)
• Is used to learn statistics and fuzzy hashes
• Has 3 levels of access:
• Trusted IP addresses (both read and write)
• Normal password* (read commands)
• Enable password* (all commands)
* Passwords are encouraged to be stored encrypted using slow hash function
Service workers
• Are used by rspamd internally and usually have no external API
• The following types are defined:
• Fuzzy storage — stores fuzzy hashes and is learned from
the controller and accessed from scanners
• Lua worker — LUA application server
• SMTP proxy — SMTP balancing proxy with RBL filtering
• HTTP proxy — balancing HTTP proxy with encryption
support
Internal architecture
libservergmime
libucl
luajit
http-parser
📝Config📬Results✉✉✉
✉✉✉
pcre
aho-corasiclibrdns
Statistics architecture
Bayes operations
• Uses sparsed 5-gramms
• Uses messages’ metadata (User-Agent, some
specific headers)
• Uses inverse chi-square function to combine
probabilities
• Weights of the tokens are based on theirs positions
Statistics benchmarks
Hard cases (images spam)
Spam trigger
5%
95%
Spam symbol Not detected
Ham trigger
8%
92%
Ham symbol Not detected
Statistics architecture
Bayes tokenisation
Quick brown fox jumps over lazy dog
1
2
3
4
1
2
Statistics architecture
Statistics architecture
Classifier
Tokeniser
Statfile (class)
Statfile (class)
Statfile (class)
Backend Classification
Normalised words
(utf8 + stemming)
Spam probability
Tokens
Weights
Fuzzy hashes
Overview
• Are used to match, not to classify a message
• Combine exact hashes (e.g. for images or
attachments) with shingles fuzzy match for text
• Use sqlite3 for storage
• Expire hashes slowly
• Write to all storages, read from random one
Fuzzy hashes
Shingles algorithm
Quick brown fox jumps over lazy dog
w1 w2 w3
w2 w3 w4
w3 w4 w5
w4 w5 w6
w1 w2 w3
w2 w3 w4
w3 w4 w5
w4 w5 w6
N hashes
h1
h2
h3
h4
h1’
h2’
h3’
h4’
Fuzzy hashes
Shingles algorithm
h1 h2 h3
h1’ h2’ h3’
h1’’ h2’’ h3’’
h1’’’’’ h2’’’’ h3’’’’
…
…
…
…
…
min
min
min
min
…
N shinglesN hash pipes
Fuzzy hashes
Shingles algorithm
• Probabilistic algorithm (due to min hash)
• Use sliding window for matching words
• N siphash contexts with derived keys
• Derive subkeys using blake2 function
• Current settings: window size = 3, N = 32
Part II: Performance
Overview
• Rspamd is focused on performance
• No unnecessary rules are executed
• Memory is organised in memory pools
• All performance critical tasks are done by
specialised finite-state-machines
• Approximate match is performed if possible
Rules optimisation
Global optimisations
• Stop processing when rejection score is hit
• Process negative rules first to avoid FP errors
• Execute less expensive rules first:
• Evaluate rules average execution time, score and
frequency
• Apply greedy algorithm to reorder
• Resort periodically
Rules optimisation
Local optimisations
• Each rule is additional optimised using abstract
syntax tree (AST): 3-4 times speed up for large
messages
• Each rule is split and reordered using the similar
greedy algorithm
• Regular expressions are compiled using PCRE JIT
(from 50% to 150% speed up usually)
• Lua is optimised using LuaJIT
AST optimisations
Branches cut
&
|C
! B
A
A = 0, B = 1, C = 0
0
1
10
1
0
Eval order
• 4/6 branches skipped
AST optimisations
N-ary optimisations
>
+ 2
! B
A
Eval order
C D E
What do we compare?
Here is our limit
Stop here
Parsing FSM
• For the most of time consuming operations, rspamd uses special
finite-state machines:
• headers parsing;
• received headers parsing;
• protocol parsing;
• URI parsing;
• HTML parsing
• Prefer approximate matching, meaning extraction of the most
important information and skipping less important details
IP addresses storage
Traditional radix trie
IP1 IP2
01
01
01
01
Level per bit: 32 levels for IPv4
128 levels for IPv6
IP addresses storage
Prefix skipped radix trie
010
IP1 IP2
01
01
IP addresses storage
Prefix skipped radix trie
• Can efficiently compress IP prefixes
• Lookup is much faster due to lower trie depth
• IPv4 and IPv6 addresses can live within a single
trie
• Insertion is also faster
• Algorithm is much harder but extensively tested
Library optimisations
Logger interface
• Universal logger for files/syslog/console
• Filters non-ascii (or non-utf8 if enabled) symbols
• Allows skipping of repeated messages
• Can disable processing in case of throttling
• Can handle both privileged and non-privileged
reopening
Library optimisations
Printf interface
• Libc printf is slow and stupid
• Rspamd printf is inspired by nginx printf:
• Supports fixed integers (int64_t, uint32_t)
• Supports fixed length string (%v)
• Supports encoded strings and numbers (human-readable, hex
encoding, base64 and so on)
• Supports various backends: fixed size buffers, automatically growing
strings, files, console…
• Rspamd printf does not try to print input when output is overflowed (so it’s
impossible to force it to use CPU resources for ridiculously large strings)
Library optimisations
String operations
• Fast base64/base32 operations:
• alignment optimisations;
• use loop unwinding;
• use 64 bit integers instead of characters
• Fast lowercase:
• use the same optimisations for ASCII string
• approximate lowercase for UTF8 (not 100% correct but much faster)
• Fast lines counting: http://git.io/vYldq
Library optimisations
Generic tools
• Fast hash functions (xxhash and blake2)
• Fast encryption (using SIMD instructions if
possible)
• Use mmap when possible
• Align memory for faster operations
• Use google performance tools to find bottlenecks
Part III: Security
Main points
• Maintaining secure coding is hard for C:
• Prefer fixed length strings
• Avoid insecure functions
• Abort if malloc fails
• Assertions on bad input
• Testing (functional + unit testing)
• Main treats:
• Interaction with DNS
• Passive snooping of traffic
• Specially crafted messages
DNS security
• DNS is the major point to interact with the external
world
• There could be thousands requests per second
• DNS replies can be untrusted
• SPF records could be recursive
• DKIM records could be malformed
• Need local and global DNS requests limit
RDNS library
• Uses secure DNS ID generator based on crypto
permutation and entropy reseeding
• Uses sockets pool with time/usage expiration
• Randomises source port
• Carefully filters input data (+IDN encoding)
Transport encryption
• Designed to be fast, simple and secure
• TLS is too hard to manage in events based model
• Many functions of TLS are useless for rspamd
• TLS involves intermediate copying and significant
latency increase
HTTPCrypt in nutshell
Handshakes
ClientHelo
ServerHelo
Certificates⚠
ClientChangeCipher
ServerChangeCipher
Data
TLS handshake
ClientPK
Encrypted request
Encrypted reply
Validation (optional)
HTTPCrypt handshake
Performance
Throughput
HTTPCrypt HTTP+TLS (nginx)
Performance
Latency
HTTPCrypt HTTP+TLS (nginx)
Performance analysis
Why HTTPCrypt is fast
• For new sessions, HTTPCrypt uses curve25519 ECDH
which is almost twice faster than NIST P-256 ECDH
• There is no signing operation and no ECDSA
• For bulk encryption, there is no intermediate buffering like
in TLS - the payload is encrypted in-place
• Latency is reduced by skipping the full TLS handshake
• Large requests are somehow slow due to lack of chunked
encoding in HTTPCrypt implementation and some clever
tricks of data reading
Hashes security
• Hash tables are vulnerable for untrusted data:
• Rspamd randomly chooses hash tables seed at start
that is hard to predict
• XXHash is used for good speed and hash distribution
• Siphash is used for public hash tables (e.g. fuzzy
hashes)
• It’s hard to predict hash seed, hence it’s hard to
organise computational attack on hash tables
Part IV: Configuration
Configuration evolution
1. Grammar parser (lex + yacc)
⛔ Hard to manage
⛔ Hard to extend
2. XML
⛔ Unreadable
⛔ Problems with expressions (A > B)
3. UCL - universal configuration language
✅ Easy to manage (looks like nginx.conf)
✅ Macro support
✅ JSON data model (can be used as JSON parser)
UCL building blocks
• Sections
• Arrays
• Variables
• Macros
• Comments
section {
key = “value”;
number = 10K;
}
upstreams = [
“localhost:80”,
“example.com:8080”,
]
static_dir = “${WWWDIR}/“;
filepath = “${CURDIR}/data”;
.include “${CONFDIR}/workers.conf”
.include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf”
.lua { print(“hey!”); }
key = value; // Single line comment
/* Multiline comment
/* can also be nested */
*/
Configuration components
Global options
Workers configuration
Scores (metrics)
Modules configuration
Statistics
• Each component is normally included
to the main configuration
• rspamd.local.conf is used to extend
configuration
• rspamd.override.conf is used to
override values in the configuration
• It is possible to use numeric
multipliers: “k/m/g” or “ms/s/m/h/d” for
time values
Lua rules
• The most of rules are defined in LUA configuration
• Two types of LUA rules:
• Regexp rules (look like strings)
• Lua functions (pure LUA code)
Lua rules
Some examples
-- Outlook versions that should be excluded from summary rule
local fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H'
local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H'
-- Summary rule for forged outlook
reconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s',
forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid)
• Regexp rule
• Lua rule
rspamd_config.R_EMPTY_IMAGE = function(task)
local tp = task:get_text_parts() -- get text parts in a message
for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs`
if p:is_html() then -- if the current part is html part
local hc = p:get_html() -- we get HTML context
local len = p:get_length() -- and part's length
if len < 50 then -- if we have a part that has less than 50 bytes of text
local images = hc:get_images() -- then we check for HTML images
if images then -- if there are images
for _,i in ipairs(images) do -- then iterate over images in the part
if i['height'] + i['width'] >= 400 then -- if we have a large image
return true -- add symbol
end
end
end
end
end
end
end
Pure LUA functions
Review
• Are very powerful
• Have access to all information from rspamd via lua
API: https://rspamd.com/doc/lua/
• Are very fast since C <-> LUA interaction is cheap
• Can use zero-copy objects called rspamd{text} to
avoid copying when moving data between C and
LUA
Pure LUA functions
• Variables:
• Conditionals:
• Loops:
• Tables:
• Functions:
• Closures:
local ret = false -- Generic variable
local rules = {} -- Empty table
local rspamd_logger = require “rspamd_logger" -- Load rspamd module
if not ret then -- can use ‘not’, ‘and’, ‘or’ here
…
elseif ret ~= 10 then -- note ~= for ‘not equal’ operator
end
for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = value
for _,i in ipairs(images) do … end -- Iterate over array table a[1] = value
for i=1,10 do … end -- Count from 1 to 10
local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1
another_key = function(task) … end, -- Functions can be values
[2] = {} -- Other tables can be values
} -- Can have both numbers and strings as key and anything as values
local function something(task) -- Normal definition
local cb = function(data) -- Functions can be nested
…
end
end
local function gen_closure(option)
local ret = false -- Local variable
return function(task)
task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here
end
end
rspamd_config.SYMBOL = gen_closure(‘some_option’)
Pure LUA functions
Generic recommendations
• Use local whenever possible (otherwise, global
variables are expensive)
• Callbacks, closures and recursion are generally cheap
(when using LuaJIT)
• Do not mix string and number keys in tables, that
makes them hard to iterate
• ipairs and pairs are not equal
• Strings are constant in LUA
Regexp rules
Types
• Can work with the following elements:
• Headers: Message-Id=/^something$/H
• Mime parts: /some word/P
• Raw messages: /some pattern/M
• URLs: /example.com/U
• Some new flags are added:
• UTF8 flag: /u
Regexp rules
Generic information
• Can be combined using the following operators:
• AND: /something/P && Subject=/some/H
• OR: /something/P || Subject=/some/H
• NOT: !/something/P
• PLUS: /A/P + /B/P + /C/P >= 2
• Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS
• Braces can change priority: !A AND (B OR C)
Regexp rules
Performance considerations
• Avoid message regexps at any cost (use trie
instead)
• Regexp expressions are highly optimised in rspamd
and unnecessary evaluations are not performed
• UTF regexps are more expensive than default ones
(but could be useful sometimes)
• Always use the appropriate type of expression (e.g.
url for links and part for textual content)
Trie matching
• Perfect for fast raw message and text pattern
matching
• Scales almost linearly from input size (aho-corasic
algorithm)
• Can handle thousands and hundreds thousands
patterns (is a base for all antivirus scanners)
• Highly optimised for 64 bits systems
Questions?
Vsevolod Stakhov
https://rspamd.com

More Related Content

What's hot

Scapy TLS: A scriptable TLS 1.3 stack
Scapy TLS: A scriptable TLS 1.3 stackScapy TLS: A scriptable TLS 1.3 stack
Scapy TLS: A scriptable TLS 1.3 stackAlexandre Moneger
 
Donatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsDonatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsTanya Denisyuk
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014Derek Collison
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with ErlangMaxim Kharchenko
 
Graylog2 (MongoBerlin/MongoHamburg 2010)
Graylog2 (MongoBerlin/MongoHamburg 2010)Graylog2 (MongoBerlin/MongoHamburg 2010)
Graylog2 (MongoBerlin/MongoHamburg 2010)lennartkoopmann
 
NPF scripting with Lua by Lourival Vieira Neto
NPF scripting with Lua by Lourival Vieira NetoNPF scripting with Lua by Lourival Vieira Neto
NPF scripting with Lua by Lourival Vieira Netoeurobsdcon
 
Introduction to OpenHFT for Melbourne Java Users Group
Introduction to OpenHFT for Melbourne Java Users GroupIntroduction to OpenHFT for Melbourne Java Users Group
Introduction to OpenHFT for Melbourne Java Users GroupPeter Lawrey
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipcPeter Lawrey
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in financePeter Lawrey
 
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPROIDEA
 
Profiling with Devel::NYTProf
Profiling with Devel::NYTProfProfiling with Devel::NYTProf
Profiling with Devel::NYTProfbobcatfish
 

What's hot (20)

Scapy TLS: A scriptable TLS 1.3 stack
Scapy TLS: A scriptable TLS 1.3 stackScapy TLS: A scriptable TLS 1.3 stack
Scapy TLS: A scriptable TLS 1.3 stack
 
Donatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsDonatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIs
 
erlang 101
erlang 101erlang 101
erlang 101
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at Xiaomi
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
 
Graylog2 (MongoBerlin/MongoHamburg 2010)
Graylog2 (MongoBerlin/MongoHamburg 2010)Graylog2 (MongoBerlin/MongoHamburg 2010)
Graylog2 (MongoBerlin/MongoHamburg 2010)
 
Nmap for Scriptors
Nmap for ScriptorsNmap for Scriptors
Nmap for Scriptors
 
NPF scripting with Lua by Lourival Vieira Neto
NPF scripting with Lua by Lourival Vieira NetoNPF scripting with Lua by Lourival Vieira Neto
NPF scripting with Lua by Lourival Vieira Neto
 
Introduction to OpenHFT for Melbourne Java Users Group
Introduction to OpenHFT for Melbourne Java Users GroupIntroduction to OpenHFT for Melbourne Java Users Group
Introduction to OpenHFT for Melbourne Java Users Group
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in finance
 
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
 
Profiling with Devel::NYTProf
Profiling with Devel::NYTProfProfiling with Devel::NYTProf
Profiling with Devel::NYTProf
 

Viewers also liked

New solver for FreeBSD pkg
New solver for FreeBSD pkgNew solver for FreeBSD pkg
New solver for FreeBSD pkgVsevolod Stakhov
 
Pkg slides from BSDCan conference
Pkg slides from BSDCan conferencePkg slides from BSDCan conference
Pkg slides from BSDCan conferenceVsevolod Stakhov
 
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...Ontico
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲ChinaNetCloud
 

Viewers also liked (7)

ast-rspamd
ast-rspamdast-rspamd
ast-rspamd
 
Rspamd testing
Rspamd testingRspamd testing
Rspamd testing
 
Rspamd symbols
Rspamd symbolsRspamd symbols
Rspamd symbols
 
New solver for FreeBSD pkg
New solver for FreeBSD pkgNew solver for FreeBSD pkg
New solver for FreeBSD pkg
 
Pkg slides from BSDCan conference
Pkg slides from BSDCan conferencePkg slides from BSDCan conference
Pkg slides from BSDCan conference
 
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
 

Similar to rspamd-slides

Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?DoiT International
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code AuditingSam Bowne
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedis Labs
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingSam Bowne
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Fedora Developer's Conference 2014 Talk
Fedora Developer's Conference 2014 TalkFedora Developer's Conference 2014 Talk
Fedora Developer's Conference 2014 TalkRainer Gerhards
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
RSYSLOG v8 improvements and how to write plugins in any language.
RSYSLOG v8 improvements and how to write plugins in any language.RSYSLOG v8 improvements and how to write plugins in any language.
RSYSLOG v8 improvements and how to write plugins in any language.Rainer Gerhards
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 

Similar to rspamd-slides (20)

Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code Auditing
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Server Tips
Server TipsServer Tips
Server Tips
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code Auditing
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Fedora Developer's Conference 2014 Talk
Fedora Developer's Conference 2014 TalkFedora Developer's Conference 2014 Talk
Fedora Developer's Conference 2014 Talk
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
RSYSLOG v8 improvements and how to write plugins in any language.
RSYSLOG v8 improvements and how to write plugins in any language.RSYSLOG v8 improvements and how to write plugins in any language.
RSYSLOG v8 improvements and how to write plugins in any language.
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Redis acc 2015_eng
Redis acc 2015_engRedis acc 2015_eng
Redis acc 2015_eng
 
Performance
PerformancePerformance
Performance
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 

rspamd-slides

  • 3. Rspamd in nutshell • Uses multiple rules to evaluate messages scores • Is written in C • Uses event driven processing model • Supports plugins in LUA • Has self-contained management web interface
  • 4. Design goals • Orientation on the mass mail processing • Performance is the cornerstone of the whole project • State-of-art techniques to filter spam • Prefer dynamic filters (statistics, hashes, DNS lists and so on) to static ones (plain regexp)
  • 6. Event driven processing Never blocks* • Pros: ✅ Can process rules while waiting for network services ✅ Can send all network requests simultaneously ✅ Can handle multiple messages within the same process • Cons: 📛 Callbacks hell (hard development) ⛔ Hard to limit memory usage due to unlimited concurrency *almost all the time
  • 7. Sequential processing Traditional approach Rule 1 Rule 2 Rule 3 DNS Hashes Wait Wait Timeline
  • 8. Event driven model Rspamd approach Rule 1 Rule 2 Rule 3 DNS Hashes Wait Timeline
  • 9. Rule 1 Rule 1 Rules DNS Hashes Wait Event driven model What happens in the real life
  • 10. Event driven model Some measurements • Rspamd can send hundred thousands of DNS requests per second (RBL, URI blacklists, custom DNS lists): time: 5540.8ms real, 2427.4ms virtual, dns req: 120543 • For small messages (which are 99% of typical mail) network processing is hundreds times more expensive than direct processing: time: 996.140ms real, 22.000ms virtual, • Event model scales very well allowing highest possible concurrency level within a single process (no locking is needed normally)
  • 11. Real message processing We need to go deeper RulesRules Filters RulesRules Pre-filters RulesRules Post-filters 📩 Message 📬 Result Wait Wait Wait (dependencies)
  • 12. Real message processing We need to go deeper • Pre filters are used to evaluate message or to reject/accept it early (e.g. greylisting) • Normal rules add scores (positive or negative) • Post filters combine rules and adjust scores if needed (e.g. composite rules) • Normal rules can also depend on each other (additional waiting)
  • 13. Rspamd processes Overview Scanning processes Scanning processes Scaner processes Controller Service processes Main process ✉✉✉ HTTP 📬 ✉✉✉ Learn Messages Results
  • 14. Main process One to rule them all… • Reads configuration • Manages worker processes • Listens on sockets • Opens and reopen log files • Handles dead workers • Handles signals • Reloads configuration • Handle command line Main process ⬇ Process ProcessProcess Process 📃Logs Listen sockets 📝Config ⬇ ⬇ Signals
  • 15. Scanner process • Scans messages and returns result • Uses HTTP for operations • Reply format is JSON • Has SA compatibility protocol
  • 16. Controller worker • Provides data for web interface (acts as HTTP server for AJAX requests and serving static files) • Is used to learn statistics and fuzzy hashes • Has 3 levels of access: • Trusted IP addresses (both read and write) • Normal password* (read commands) • Enable password* (all commands) * Passwords are encouraged to be stored encrypted using slow hash function
  • 17. Service workers • Are used by rspamd internally and usually have no external API • The following types are defined: • Fuzzy storage — stores fuzzy hashes and is learned from the controller and accessed from scanners • Lua worker — LUA application server • SMTP proxy — SMTP balancing proxy with RBL filtering • HTTP proxy — balancing HTTP proxy with encryption support
  • 19. Statistics architecture Bayes operations • Uses sparsed 5-gramms • Uses messages’ metadata (User-Agent, some specific headers) • Uses inverse chi-square function to combine probabilities • Weights of the tokens are based on theirs positions
  • 20. Statistics benchmarks Hard cases (images spam) Spam trigger 5% 95% Spam symbol Not detected Ham trigger 8% 92% Ham symbol Not detected
  • 21. Statistics architecture Bayes tokenisation Quick brown fox jumps over lazy dog 1 2 3 4 1 2
  • 22. Statistics architecture Statistics architecture Classifier Tokeniser Statfile (class) Statfile (class) Statfile (class) Backend Classification Normalised words (utf8 + stemming) Spam probability Tokens Weights
  • 23. Fuzzy hashes Overview • Are used to match, not to classify a message • Combine exact hashes (e.g. for images or attachments) with shingles fuzzy match for text • Use sqlite3 for storage • Expire hashes slowly • Write to all storages, read from random one
  • 24. Fuzzy hashes Shingles algorithm Quick brown fox jumps over lazy dog w1 w2 w3 w2 w3 w4 w3 w4 w5 w4 w5 w6 w1 w2 w3 w2 w3 w4 w3 w4 w5 w4 w5 w6 N hashes h1 h2 h3 h4 h1’ h2’ h3’ h4’
  • 25. Fuzzy hashes Shingles algorithm h1 h2 h3 h1’ h2’ h3’ h1’’ h2’’ h3’’ h1’’’’’ h2’’’’ h3’’’’ … … … … … min min min min … N shinglesN hash pipes
  • 26. Fuzzy hashes Shingles algorithm • Probabilistic algorithm (due to min hash) • Use sliding window for matching words • N siphash contexts with derived keys • Derive subkeys using blake2 function • Current settings: window size = 3, N = 32
  • 28. Overview • Rspamd is focused on performance • No unnecessary rules are executed • Memory is organised in memory pools • All performance critical tasks are done by specialised finite-state-machines • Approximate match is performed if possible
  • 29. Rules optimisation Global optimisations • Stop processing when rejection score is hit • Process negative rules first to avoid FP errors • Execute less expensive rules first: • Evaluate rules average execution time, score and frequency • Apply greedy algorithm to reorder • Resort periodically
  • 30. Rules optimisation Local optimisations • Each rule is additional optimised using abstract syntax tree (AST): 3-4 times speed up for large messages • Each rule is split and reordered using the similar greedy algorithm • Regular expressions are compiled using PCRE JIT (from 50% to 150% speed up usually) • Lua is optimised using LuaJIT
  • 31. AST optimisations Branches cut & |C ! B A A = 0, B = 1, C = 0 0 1 10 1 0 Eval order • 4/6 branches skipped
  • 32. AST optimisations N-ary optimisations > + 2 ! B A Eval order C D E What do we compare? Here is our limit Stop here
  • 33. Parsing FSM • For the most of time consuming operations, rspamd uses special finite-state machines: • headers parsing; • received headers parsing; • protocol parsing; • URI parsing; • HTML parsing • Prefer approximate matching, meaning extraction of the most important information and skipping less important details
  • 34. IP addresses storage Traditional radix trie IP1 IP2 01 01 01 01 Level per bit: 32 levels for IPv4 128 levels for IPv6
  • 35. IP addresses storage Prefix skipped radix trie 010 IP1 IP2 01 01
  • 36. IP addresses storage Prefix skipped radix trie • Can efficiently compress IP prefixes • Lookup is much faster due to lower trie depth • IPv4 and IPv6 addresses can live within a single trie • Insertion is also faster • Algorithm is much harder but extensively tested
  • 37. Library optimisations Logger interface • Universal logger for files/syslog/console • Filters non-ascii (or non-utf8 if enabled) symbols • Allows skipping of repeated messages • Can disable processing in case of throttling • Can handle both privileged and non-privileged reopening
  • 38. Library optimisations Printf interface • Libc printf is slow and stupid • Rspamd printf is inspired by nginx printf: • Supports fixed integers (int64_t, uint32_t) • Supports fixed length string (%v) • Supports encoded strings and numbers (human-readable, hex encoding, base64 and so on) • Supports various backends: fixed size buffers, automatically growing strings, files, console… • Rspamd printf does not try to print input when output is overflowed (so it’s impossible to force it to use CPU resources for ridiculously large strings)
  • 39. Library optimisations String operations • Fast base64/base32 operations: • alignment optimisations; • use loop unwinding; • use 64 bit integers instead of characters • Fast lowercase: • use the same optimisations for ASCII string • approximate lowercase for UTF8 (not 100% correct but much faster) • Fast lines counting: http://git.io/vYldq
  • 40. Library optimisations Generic tools • Fast hash functions (xxhash and blake2) • Fast encryption (using SIMD instructions if possible) • Use mmap when possible • Align memory for faster operations • Use google performance tools to find bottlenecks
  • 42. Main points • Maintaining secure coding is hard for C: • Prefer fixed length strings • Avoid insecure functions • Abort if malloc fails • Assertions on bad input • Testing (functional + unit testing) • Main treats: • Interaction with DNS • Passive snooping of traffic • Specially crafted messages
  • 43. DNS security • DNS is the major point to interact with the external world • There could be thousands requests per second • DNS replies can be untrusted • SPF records could be recursive • DKIM records could be malformed • Need local and global DNS requests limit
  • 44. RDNS library • Uses secure DNS ID generator based on crypto permutation and entropy reseeding • Uses sockets pool with time/usage expiration • Randomises source port • Carefully filters input data (+IDN encoding)
  • 45. Transport encryption • Designed to be fast, simple and secure • TLS is too hard to manage in events based model • Many functions of TLS are useless for rspamd • TLS involves intermediate copying and significant latency increase
  • 50. Performance analysis Why HTTPCrypt is fast • For new sessions, HTTPCrypt uses curve25519 ECDH which is almost twice faster than NIST P-256 ECDH • There is no signing operation and no ECDSA • For bulk encryption, there is no intermediate buffering like in TLS - the payload is encrypted in-place • Latency is reduced by skipping the full TLS handshake • Large requests are somehow slow due to lack of chunked encoding in HTTPCrypt implementation and some clever tricks of data reading
  • 51. Hashes security • Hash tables are vulnerable for untrusted data: • Rspamd randomly chooses hash tables seed at start that is hard to predict • XXHash is used for good speed and hash distribution • Siphash is used for public hash tables (e.g. fuzzy hashes) • It’s hard to predict hash seed, hence it’s hard to organise computational attack on hash tables
  • 53. Configuration evolution 1. Grammar parser (lex + yacc) ⛔ Hard to manage ⛔ Hard to extend 2. XML ⛔ Unreadable ⛔ Problems with expressions (A &gt; B) 3. UCL - universal configuration language ✅ Easy to manage (looks like nginx.conf) ✅ Macro support ✅ JSON data model (can be used as JSON parser)
  • 54. UCL building blocks • Sections • Arrays • Variables • Macros • Comments section { key = “value”; number = 10K; } upstreams = [ “localhost:80”, “example.com:8080”, ] static_dir = “${WWWDIR}/“; filepath = “${CURDIR}/data”; .include “${CONFDIR}/workers.conf” .include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf” .lua { print(“hey!”); } key = value; // Single line comment /* Multiline comment /* can also be nested */ */
  • 55. Configuration components Global options Workers configuration Scores (metrics) Modules configuration Statistics • Each component is normally included to the main configuration • rspamd.local.conf is used to extend configuration • rspamd.override.conf is used to override values in the configuration • It is possible to use numeric multipliers: “k/m/g” or “ms/s/m/h/d” for time values
  • 56. Lua rules • The most of rules are defined in LUA configuration • Two types of LUA rules: • Regexp rules (look like strings) • Lua functions (pure LUA code)
  • 57. Lua rules Some examples -- Outlook versions that should be excluded from summary rule local fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H' local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H' -- Summary rule for forged outlook reconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s', forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid) • Regexp rule • Lua rule rspamd_config.R_EMPTY_IMAGE = function(task) local tp = task:get_text_parts() -- get text parts in a message for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs` if p:is_html() then -- if the current part is html part local hc = p:get_html() -- we get HTML context local len = p:get_length() -- and part's length if len < 50 then -- if we have a part that has less than 50 bytes of text local images = hc:get_images() -- then we check for HTML images if images then -- if there are images for _,i in ipairs(images) do -- then iterate over images in the part if i['height'] + i['width'] >= 400 then -- if we have a large image return true -- add symbol end end end end end end end
  • 58. Pure LUA functions Review • Are very powerful • Have access to all information from rspamd via lua API: https://rspamd.com/doc/lua/ • Are very fast since C <-> LUA interaction is cheap • Can use zero-copy objects called rspamd{text} to avoid copying when moving data between C and LUA
  • 59. Pure LUA functions • Variables: • Conditionals: • Loops: • Tables: • Functions: • Closures: local ret = false -- Generic variable local rules = {} -- Empty table local rspamd_logger = require “rspamd_logger" -- Load rspamd module if not ret then -- can use ‘not’, ‘and’, ‘or’ here … elseif ret ~= 10 then -- note ~= for ‘not equal’ operator end for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = value for _,i in ipairs(images) do … end -- Iterate over array table a[1] = value for i=1,10 do … end -- Count from 1 to 10 local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1 another_key = function(task) … end, -- Functions can be values [2] = {} -- Other tables can be values } -- Can have both numbers and strings as key and anything as values local function something(task) -- Normal definition local cb = function(data) -- Functions can be nested … end end local function gen_closure(option) local ret = false -- Local variable return function(task) task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here end end rspamd_config.SYMBOL = gen_closure(‘some_option’)
  • 60. Pure LUA functions Generic recommendations • Use local whenever possible (otherwise, global variables are expensive) • Callbacks, closures and recursion are generally cheap (when using LuaJIT) • Do not mix string and number keys in tables, that makes them hard to iterate • ipairs and pairs are not equal • Strings are constant in LUA
  • 61. Regexp rules Types • Can work with the following elements: • Headers: Message-Id=/^something$/H • Mime parts: /some word/P • Raw messages: /some pattern/M • URLs: /example.com/U • Some new flags are added: • UTF8 flag: /u
  • 62. Regexp rules Generic information • Can be combined using the following operators: • AND: /something/P && Subject=/some/H • OR: /something/P || Subject=/some/H • NOT: !/something/P • PLUS: /A/P + /B/P + /C/P >= 2 • Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS • Braces can change priority: !A AND (B OR C)
  • 63. Regexp rules Performance considerations • Avoid message regexps at any cost (use trie instead) • Regexp expressions are highly optimised in rspamd and unnecessary evaluations are not performed • UTF regexps are more expensive than default ones (but could be useful sometimes) • Always use the appropriate type of expression (e.g. url for links and part for textual content)
  • 64. Trie matching • Perfect for fast raw message and text pattern matching • Scales almost linearly from input size (aho-corasic algorithm) • Can handle thousands and hundreds thousands patterns (is a base for all antivirus scanners) • Highly optimised for 64 bits systems