SlideShare a Scribd company logo
Part I: Introduction
Why Rspamd?
SpamAssassin → rspamd in “Rambler” Mail
Rspamd in action
Real-time stats
Rspamd in action
Web UI
Rspamd in a nutshell
• Written in plain C
• Uses non-blocking (event-driven) data processing
model
• Oriented on performance
• Plugins and rules are written in Lua
• Supports all major filtering methods
• Packaged for the most of *nix systems
Spam filtering
Major spam classes
• Fraud and scam:
• “Nigerian” mail
• Phishing (banks, social networks)
• Advertisement:
• Plain text
• Pictures, obfuscated HTML etc
• Spam from the social networks notifications
Scam:a typical example
Advertisement: images
Advertisement: images
Filtering methods
Basic principles
• Policies (particularly efficient against fraud)
• DNS lists (for IP and URLs)
• SPF, DKIM, DMARC
• Sender’s reputation
• Static content filtering (patterns match)
• Statistics and ML (e.g. Bayesian)
Filtering methods
Difficulties
• Policies:
• Require network requests with unpredictable delay
• False positive (e.g. when forwarding)
• Static Content filtering:
• CPU intensive
• Unlikely to be useful long-term
• Statistics:
• Not very precise
• Hard to adopt for multi-user systems
Rspamd special features
• Text lemmatisation and normalisation (snowball stemmer)
• Rspamd converts all texts to utf-8
• Support for users settings in JSON
• Possibility to have multiple metrics for a message
• Web interface
• Rspamd can load SpamAssassin rules directly
• OpenSource (BSD 2-clause)
Rspamd integration
Using Rmilter
MTA Rmilter Rspamd
Storage
Message and SMTP DATA HTTP request
Redis cache
JSON: action, symbolsDelay, modify or reject
Greylisting, limits Statistics, shared cache
✉✉✉✉
✉ ✉
🚮
Inbox Spam
✉
Reject
✉
Rspamd integration
milter
• rmilter + postfix/sendmail:
• supports all rspamd features (e.g. greylisting
action)
• can split inbound and outbound messages
• can whitelist reply to own messages
• can add DKIM signatures to outgoing email
Rspamd integration
Other MTA
• exim - supports rspamd internally (from 4.86)
• haraka - supports rspamd internally
• any other MTA - LDA mode when rspamc checks
message on rspamd and calls delivery agent with
altered message using Unix pipes
Part II: Architecture
Development principles
• Performance and scalability
• Filtering quality and extensions
• Security
• Simplicity and documentation coverage
Rules in Rspamd
Pre-filters Filters Post-filters
Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule
Policy rules Content rules Statistics rules
Settings, ratelimit Main scan Postprocessing
Message lifecycle
RulesRules
Filters
RulesRules
Pre-filters
RulesRules
Post-filters
📩 Message
📬 Result
Wait
Wait
Wait for dependencies
Event driven model
Never blocks*
• Advantages:
✅ Can process network replies asynchronously
✅ Sends all network request simultaneously
✅ Can process many messages in parallel, scales well
• Disadvantages:
📛 Callbacks hell
⛔ Memory consumption can go wild
*almost never
Sequential processing
Traditional approach
Rule 1 Rule 2 Rule 3
DNS Hashes
Wait Wait
Processing
Event-driven processing
Rule 1 Rule 2 Rule 3
DNS Hashes
Wait
Processing
Rule 1
Rule 1
Rules
DNS Hashes
Wait
Event-driven processing
What really happens?
Event-driven processing
• Rspamd can send thousands of DNS requests: 

time: 5540.8ms real, 2427.4ms virtual, dns req: 120543
Event-driven processing
• Rspamd can send thousands of DNS requests: 

time: 5540.8ms real, 2427.4ms virtual, dns req: 120543
• For short messages waiting prevails on the
processing time: 

time: 996.140ms real, 22.000ms virtual,
Statistics in rspamd
• Uses Markov chains for statistic tokens (using
sparse bi-gramms model)
• Considers metadata in statistics: mime structure,
headers and so on
• Statistics could be stored in redis or sqlite3
• Rspamd supports per-user, per-domain statistics,
multiple classifiers and autolearning
Tokens generation
Quick brown fox jumps over lazy dog
1
2
3
4
1
2
Statistics example
Hard case: images spam
0%
25%
50%
75%
100%
Spam Ham
8%5%
92%95%
Correctly recognized Unrecognized
Fuzzy hashes
Introduction
• Can find similar or close to similar messages
• Rspamd uses fuzzy logic for text and exact
hashes for attachments
• Storage engine - sqlite3
Shingles Algorithm
Quick brown fox jumps over lazy dog
w1 w2 w3
w2 w3 w4
w3 w4 w5
w4 w5 w6
w1 w2 w3
w2 w3 w4
w3 w4 w5
w4 w5 w6
N hashes
h1
h2
h3
h4
h1’
h2’
h3’
h4’
Shingles Algorithm
h1 h2 h3
h1’ h2’ h3’
h1’’ h2’’ h3’’
h1’’’’’ h2’’’’ h3’’’’
…
…
…
…
…
min
min
min
min
…
N shinglesN hash pipes
Shingles Algorithm
• Probabilistic in general
• Uses siphash with different keys
• Window size - 3 words, number of hashes - 32
Part III: Performance
Main Principles
• Always consider performance.
• If something could be skipped, skip it
• Use more specific solution if it is faster than
generic one (special FSM, asm fragments)
• If it is possible to do something fast but not 100%
accurate, try faster method
Rules Optimisation
Global optimisation
• Stop checks if a message has reached threshold
• Prioritise rules with negative weight
• Prioritise faster and more frequent rules:
• Check weight, hit probability and average time
• Use greedy sorting algorithm
Rules Optimisation
Local optimisations
• Build Abstract Syntax Tree for each rule
• Tree is periodically resorted in respect of hit
probability and execution time
• Regular expressions are optimised using PCRE
+JIT and Hyperscan (from 1.1)
• Lua code is optimised by means of LuaJIT
AST Optimisations
Branches cut
&
|C
! B
A
A = 0, B = 1, C = 0
0
1
10
1
0
Order of processing
4/6 branches skipped
(!A or B) and C
AST optimisations
N-ary operations
>
+ 2
! B
A
Check order
C D E
What we compare?
Here is our limit
Stop processing
(!A+B+C+D+E) > 2
C Library optimisations
• Accelerated base32/base64
• Optimised string functions
• Fixed size strings
• Own printf implementation
• Own logger layer
• Assembly versions for crypto ciphers and hashes
Hyperscan in Rspamd
Multi-regexp match engine
Hyperscan in Rspamd
Problem statement
• Need to match many regular expressions for the
same data:
len: 610591, time: 2492.457ms real, 882.251ms virtual
regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre
• No need to capture patterns
• Need to match both UTF8 and raw data
• Need to support Perl compatible syntax
Naive solution
• Join all expressions with ‘|’:
/abc/ + /cde/ => /(?:abc)|(?:cde)/
• Won’t work:
/abc/ + /a/ => /(?:abc)|(?:a)/ -> the second rule won’t match for ‘abc’
Hyperscan
Text
/re1/
/re2/
/re3/
/re4/
/re5/
/re6/
/re7/
Scan
Match
Match
Match
Rules and Hyperscan
From
Subject
Text Part
HTML Part
Attachment
Header RE class
Header RE class
Text Parts class
Raw Body class
Whole message
m1
m2
m3
m4
m5
m6
MIME message Hyperscan regexps
Problems with Hyperscan
• Slow compile time (like 100 times slower than
PCRE+JIT)
• Many PCRE features are unsupported (lookahead,
look-behind)
• Is not packaged for the major OSes (requires
relatively fresh boost)
Lazy hyperscan compile
HS compiler
Main
Scanners
Compiled
class
Compiled
class
Compiled
class
Use PCRE only
Compile to disk
Lazy hyperscan compile
HS compiler
Main
Scanners
Compiled
class
Compiled
class
Compiled
class
Use PCRE only
All done Broadcast message
Lazy hyperscan compile
HS compiler
Main
Scanners
Compiled
class
Compiled
class
Compiled
class
Switch to hyperscan
Loaded from cache
Wait for ‘recompile’ command
Lazy hyperscan compile
• Does not affect the starting time
• If nothing changes then scanners load pre-
compiled expressions on startup (using crypto
hashes)
• Recompile command is used to force
recompilation process:
rspamadm recompile
Unsupported features
Pre-filter mode
Text
/re1/pre
/re2/
/re3/pre
/re4/
/re5/
/re6/
/re7/
Scan
Maybe match?
Match
Match
Check
PCRE
Match!
Replaces unsupported constructs with "weaker"
supported ones, guaranteeing a match superset:
Hyperscan results
Use Hyperscan for the vast majority of RE rules:
len: 610591, time: 2492.457ms real, 882.251ms virtual
regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre
len: 610591, time: 654.596ms real, 309.785ms virtual
regexp statistics: 34 pcre regexps scanned, 41 regexps matched, 8.41M bytes scanned using pcre,
9.56M bytes scanned total
Part IV: Security
Main Principles
• Use secure C programming guidelines
• Perform static analysis for the code including
special clang plugin for own printf format safety
• Deal with malicious messages
• Limit and protect DNS subsystem
• Encrypt and protect all data passed over the
network
Examples
• An email can contain thousands of objects to
check (e.g. URLs)
• SPF records could have infinite recursion
• Need limits for network checks and memoization
of the results
DNS protection
• Cryptographic DNS id generation
• Socket pool with constant rotation
• Source port randomisation
• Input filtering (+ IDN encoding)
Traffic Encryption
• Security -> performance -> simplicity
• TLS is hard in non-blocking IO
• TLS does a lot of unnecessary (in case of
rspamd) operations increasing thus CPU load and
latency
HTTPCrypt Protocol
Connection Establishment
ClientHelo
ServerHelo
Certificates⚠
ClientChangeCipher
ServerChangeCipher
Data
TLS handshake
ClientPK
Encrypted request
Encrypted reply
Validation (optional)
HTTPCrypt handshake
Throughput
Cached connections
Throughput
New connections
Throughput
New connections
Connection Latency
Cached connections
Connection Latency
New connections
Why HTTPCrypt is fast
• Faster key exchange (ECDH)
• Server does not need to sign anything

(needed for replay protection in TLS)
• Encryption is done in-place
• No replay protection for the first message - still
suitable for idempotent scanning
Conclusions
Conclusions
• Rspamd is designed to work as fast as possible
• Rspamd has a rich and actively maintained
collection of rules and plugins and can re-use
existing SA rules
• It is possible to encrypt all data transferred over
the network in rspamd ecosystem
Open issues
• No Debian maintainer (any volunteers?)
• No dynamic rules updates (planned for 1.2) yet
• Rapid development (hard to keep compatibility)
Questions?
Vsevolod Stakhov
https://rspamd.com
Appendix
Rspamd processes
✉
Scanning processes
Scanning processes
Scanner processes Controller Service processes
Main process
✉✉
HTTP
📬 ✉✉✉
Learn
Messages Results
Configuration evolution
1. Grammar parser (lex + yacc)
⛔ Hard to manage
⛔ Hard to extend
2. XML
⛔ Unreadable
⛔ Problems with expressions (A > B)
3. UCL - universal configuration language
✅ Easy to manage (looks like nginx.conf)
✅ Macro support
✅ JSON data model (can be used as JSON parser)
UCL building blocks
• Sections
• Arrays
• Variables
• Macros
• Comments
section {
key = “value”;
number = 10K;
}
upstreams = [
“localhost:80”,
“example.com:8080”,
]
static_dir = “${WWWDIR}/“;
filepath = “${CURDIR}/data”;
.include “${CONFDIR}/workers.conf”
.include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf”
.lua { print(“hey!”); }
key = value; // Single line comment
/* Multiline comment
/* can also be nested */
*/
Configuration components
Global options
Workers configuration
Scores (metrics)
Modules configuration
Statistics
• Each component is normally included
to the main configuration
• rspamd.local.conf is used to extend
configuration
• rspamd.override.conf is used to
override values in the configuration
• It is possible to use numeric
multipliers: “k/m/g” or “ms/s/m/h/d” for
time values
Lua rules
• Most of rules are defined in Lua configuration
• Two types of Lua rules:
• Regexp rules (look like strings)
• Lua functions (pure Lua code)
Lua rules
Some examples
-- Outlook versions that should be excluded from summary rule
local fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H'
local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H'
-- Summary rule for forged outlook
reconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s',
forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid)
• Regexp rule
• Lua rule
rspamd_config.R_EMPTY_IMAGE = function(task)
local tp = task:get_text_parts() -- get text parts in a message
for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs`
if p:is_html() then -- if the current part is html part
local hc = p:get_html() -- we get HTML context
local len = p:get_length() -- and part's length
if len < 50 then -- if we have a part that has less than 50 bytes of text
local images = hc:get_images() -- then we check for HTML images
if images then -- if there are images
for _,i in ipairs(images) do -- then iterate over images in the part
if i['height'] + i['width'] >= 400 then -- if we have a large image
return true -- add symbol
end
end
end
end
end
end
end
Pure LUA functions
Review
• Are very powerful
• Have access to all information from rspamd via lua
API: https://rspamd.com/doc/lua/
• Are very fast since C <-> LUA interaction is cheap
• Can use zero-copy objects called rspamd{text} to
avoid copying when moving data between C and
LUA
Pure LUA functions
• Variables:
• Conditionals:
• Loops:
• Tables:
• Functions:
• Closures:
local ret = false -- Generic variable
local rules = {} -- Empty table
local rspamd_logger = require “rspamd_logger" -- Load rspamd module
if not ret then -- can use ‘not’, ‘and’, ‘or’ here
…
elseif ret ~= 10 then -- note ~= for ‘not equal’ operator
end
for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = value
for _,i in ipairs(images) do … end -- Iterate over array table a[1] = value
for i=1,10 do … end -- Count from 1 to 10
local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1
another_key = function(task) … end, -- Functions can be values
[2] = {} -- Other tables can be values
} -- Can have both numbers and strings as key and anything as values
local function something(task) -- Normal definition
local cb = function(data) -- Functions can be nested
…
end
end
local function gen_closure(option)
local ret = false -- Local variable
return function(task)
task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here
end
end
rspamd_config.SYMBOL = gen_closure(‘some_option’)
Pure LUA functions
Generic recommendations
• Use local whenever possible (otherwise, global
variables are expensive)
• Callbacks, closures and recursion are generally cheap
(when using LuaJIT)
• Do not mix string and number keys in tables, that
makes them hard to iterate
• ipairs and pairs are not equal
• Strings are constant in LUA
Regexp rules
Types
• Can work with the following elements:
• Headers: Message-Id=/^something$/H
• Mime parts: /some word/P
• Raw messages: /some pattern/M
• URLs: /example.com/U
• Some new flags are added:
• UTF8 flag: /u
Regexp rules
Generic information
• Can be combined using the following operators:
• AND: /something/P && Subject=/some/H
• OR: /something/P || Subject=/some/H
• NOT: !/something/P
• PLUS: /A/P + /B/P + /C/P >= 2
• Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS
• Braces can change priority: !A AND (B OR C)
Regexp rules
Performance considerations
• Avoid message regexps at any cost (use trie
instead)
• Regexp expressions are highly optimised in rspamd
and unnecessary evaluations are not performed
• UTF regexps are more expensive than default ones
(but could be useful sometimes)
• Always use the appropriate type of expression (e.g.
url for links and part for textual content)
Trie matching
• Perfect for fast raw message and text pattern
matching
• Scales almost linearly with input size (aho-corasic
algorithm)
• Can handle thousands and hundreds of thousands
of patterns (is a base for all antivirus scanners)
• Highly optimised for 64 bit systems

More Related Content

What's hot

MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Rails
rfischer20
 
Point field types in Solr. Evolution of the Range Queries.
Point field types in Solr. Evolution of the Range Queries.Point field types in Solr. Evolution of the Range Queries.
Point field types in Solr. Evolution of the Range Queries.
Andrey Mikryukov
 
Rspamd testing
Rspamd testingRspamd testing
Rspamd testing
Vsevolod Stakhov
 
RTree Spatial Indexing with MongoDB - MongoDC
RTree Spatial Indexing with MongoDB - MongoDC RTree Spatial Indexing with MongoDB - MongoDC
RTree Spatial Indexing with MongoDB - MongoDC
Nicholas Knize, Ph.D., GISP
 
New PHP Exploitation Techniques
New PHP Exploitation TechniquesNew PHP Exploitation Techniques
New PHP Exploitation Techniques
RIPS Technologies GmbH
 
Intro ProxySQL
Intro ProxySQLIntro ProxySQL
Intro ProxySQL
I Goo Lee
 
The Complete MariaDB Server tutorial
The Complete MariaDB Server tutorialThe Complete MariaDB Server tutorial
The Complete MariaDB Server tutorial
Colin Charles
 
Backup and-recovery2
Backup and-recovery2Backup and-recovery2
Backup and-recovery2
Command Prompt., Inc
 
DVGA writeup
DVGA writeupDVGA writeup
DVGA writeup
Yu Iwama
 
RAT - Repurposing Adversarial Tradecraft
RAT - Repurposing Adversarial TradecraftRAT - Repurposing Adversarial Tradecraft
RAT - Repurposing Adversarial Tradecraft
⭕Alexander Rymdeko-Harvey
 
Crash dump analysis - experience sharing
Crash dump analysis - experience sharingCrash dump analysis - experience sharing
Crash dump analysis - experience sharing
James Hsieh
 
Cryptography its history application and beyond
Cryptography its history application and beyondCryptography its history application and beyond
Cryptography its history application and beyond
kinleay
 
Webscripts
WebscriptsWebscripts
Webscripts
Alfresco Software
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
Dvir Volk
 
Passwords#14 - mimikatz
Passwords#14 - mimikatzPasswords#14 - mimikatz
Passwords#14 - mimikatz
Benjamin Delpy
 
Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication
Mydbops
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete Tutorial
Colin Charles
 
PSConfEU - Offensive Active Directory (With PowerShell!)
PSConfEU - Offensive Active Directory (With PowerShell!)PSConfEU - Offensive Active Directory (With PowerShell!)
PSConfEU - Offensive Active Directory (With PowerShell!)
Will Schroeder
 
Idea alogorithim
Idea alogorithimIdea alogorithim
Idea alogorithim
SachinMaithani1
 

What's hot (20)

MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Rails
 
Point field types in Solr. Evolution of the Range Queries.
Point field types in Solr. Evolution of the Range Queries.Point field types in Solr. Evolution of the Range Queries.
Point field types in Solr. Evolution of the Range Queries.
 
Rspamd testing
Rspamd testingRspamd testing
Rspamd testing
 
RTree Spatial Indexing with MongoDB - MongoDC
RTree Spatial Indexing with MongoDB - MongoDC RTree Spatial Indexing with MongoDB - MongoDC
RTree Spatial Indexing with MongoDB - MongoDC
 
New PHP Exploitation Techniques
New PHP Exploitation TechniquesNew PHP Exploitation Techniques
New PHP Exploitation Techniques
 
Intro ProxySQL
Intro ProxySQLIntro ProxySQL
Intro ProxySQL
 
The Complete MariaDB Server tutorial
The Complete MariaDB Server tutorialThe Complete MariaDB Server tutorial
The Complete MariaDB Server tutorial
 
Backup and-recovery2
Backup and-recovery2Backup and-recovery2
Backup and-recovery2
 
DVGA writeup
DVGA writeupDVGA writeup
DVGA writeup
 
RAT - Repurposing Adversarial Tradecraft
RAT - Repurposing Adversarial TradecraftRAT - Repurposing Adversarial Tradecraft
RAT - Repurposing Adversarial Tradecraft
 
Crash dump analysis - experience sharing
Crash dump analysis - experience sharingCrash dump analysis - experience sharing
Crash dump analysis - experience sharing
 
Cryptography its history application and beyond
Cryptography its history application and beyondCryptography its history application and beyond
Cryptography its history application and beyond
 
Webscripts
WebscriptsWebscripts
Webscripts
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
Passwords#14 - mimikatz
Passwords#14 - mimikatzPasswords#14 - mimikatz
Passwords#14 - mimikatz
 
Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat Gulec
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete Tutorial
 
PSConfEU - Offensive Active Directory (With PowerShell!)
PSConfEU - Offensive Active Directory (With PowerShell!)PSConfEU - Offensive Active Directory (With PowerShell!)
PSConfEU - Offensive Active Directory (With PowerShell!)
 
Idea alogorithim
Idea alogorithimIdea alogorithim
Idea alogorithim
 

Viewers also liked

rspamd-hyperscan
rspamd-hyperscanrspamd-hyperscan
rspamd-hyperscan
Vsevolod Stakhov
 
New solver for FreeBSD pkg
New solver for FreeBSD pkgNew solver for FreeBSD pkg
New solver for FreeBSD pkg
Vsevolod Stakhov
 
ast-rspamd
ast-rspamdast-rspamd
ast-rspamd
Vsevolod Stakhov
 
rspamd-slides
rspamd-slidesrspamd-slides
rspamd-slides
Vsevolod Stakhov
 
Dovecot & Postfix バージョンアップ動向 201401-201505
Dovecot & Postfix バージョンアップ動向 201401-201505Dovecot & Postfix バージョンアップ動向 201401-201505
Dovecot & Postfix バージョンアップ動向 201401-201505
Narimichi Takamura
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
Vsevolod Stakhov
 
Pkg slides from BSDCan conference
Pkg slides from BSDCan conferencePkg slides from BSDCan conference
Pkg slides from BSDCan conference
Vsevolod Stakhov
 
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Ontico
 
Postfix
PostfixPostfix
Linux Based Mail Server
Linux Based Mail ServerLinux Based Mail Server
Linux Based Mail Server
Bernics Gábor
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
ChinaNetCloud
 

Viewers also liked (11)

rspamd-hyperscan
rspamd-hyperscanrspamd-hyperscan
rspamd-hyperscan
 
New solver for FreeBSD pkg
New solver for FreeBSD pkgNew solver for FreeBSD pkg
New solver for FreeBSD pkg
 
ast-rspamd
ast-rspamdast-rspamd
ast-rspamd
 
rspamd-slides
rspamd-slidesrspamd-slides
rspamd-slides
 
Dovecot & Postfix バージョンアップ動向 201401-201505
Dovecot & Postfix バージョンアップ動向 201401-201505Dovecot & Postfix バージョンアップ動向 201401-201505
Dovecot & Postfix バージョンアップ動向 201401-201505
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
Pkg slides from BSDCan conference
Pkg slides from BSDCan conferencePkg slides from BSDCan conference
Pkg slides from BSDCan conference
 
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
Практика совместного использования Lua и C в opensource спам-фильтре Rspamd /...
 
Postfix
PostfixPostfix
Postfix
 
Linux Based Mail Server
Linux Based Mail ServerLinux Based Mail Server
Linux Based Mail Server
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
 

Similar to rspamd-fosdem

Encryption in php
Encryption in phpEncryption in php
Encryption in php
sana mateen
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Anand Narayanan
 
Encryption pres
Encryption presEncryption pres
Encryption pres
MereySovet
 
CNIT 50: 6. Command Line Packet Analysis Tools
CNIT 50: 6. Command Line Packet Analysis ToolsCNIT 50: 6. Command Line Packet Analysis Tools
CNIT 50: 6. Command Line Packet Analysis Tools
Sam Bowne
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Hyperscan - Mohammad Abdul Awal
Hyperscan - Mohammad Abdul AwalHyperscan - Mohammad Abdul Awal
Hyperscan - Mohammad Abdul Awal
harryvanhaaren
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code Auditing
Sam Bowne
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
Alexandre Moneger
 
Safely Protect PostgreSQL Passwords - Tell Others to SCRAM
Safely Protect PostgreSQL Passwords - Tell Others to SCRAMSafely Protect PostgreSQL Passwords - Tell Others to SCRAM
Safely Protect PostgreSQL Passwords - Tell Others to SCRAM
Jonathan Katz
 
Debugging applications with network security tools
Debugging applications with network security toolsDebugging applications with network security tools
Debugging applications with network security tools
ConFoo
 
Basic cryptography
Basic cryptographyBasic cryptography
Basic cryptography
Perfect Training Center
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code Auditing
Sam Bowne
 
Coporate Espionage
Coporate EspionageCoporate Espionage
Coporate Espionage
UTD Computer Security Group
 
Packet Analysis - Course Technology Computing Conference
Packet Analysis - Course Technology Computing ConferencePacket Analysis - Course Technology Computing Conference
Packet Analysis - Course Technology Computing Conference
Cengage Learning
 
Follow the White Rabbit - Message Queues with PHP
Follow the White Rabbit - Message Queues with PHPFollow the White Rabbit - Message Queues with PHP
Follow the White Rabbit - Message Queues with PHP
Eric Rodriguez (Hiring in Lex)
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
Redge Technologies
 
Cryptography
CryptographyCryptography
Cassandra
CassandraCassandra
Cassandra
exsuns
 
Donatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsDonatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIs
Tanya Denisyuk
 
Server Tips
Server TipsServer Tips
Server Tips
liqingfang126
 

Similar to rspamd-fosdem (20)

Encryption in php
Encryption in phpEncryption in php
Encryption in php
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
Encryption pres
Encryption presEncryption pres
Encryption pres
 
CNIT 50: 6. Command Line Packet Analysis Tools
CNIT 50: 6. Command Line Packet Analysis ToolsCNIT 50: 6. Command Line Packet Analysis Tools
CNIT 50: 6. Command Line Packet Analysis Tools
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Hyperscan - Mohammad Abdul Awal
Hyperscan - Mohammad Abdul AwalHyperscan - Mohammad Abdul Awal
Hyperscan - Mohammad Abdul Awal
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code Auditing
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
 
Safely Protect PostgreSQL Passwords - Tell Others to SCRAM
Safely Protect PostgreSQL Passwords - Tell Others to SCRAMSafely Protect PostgreSQL Passwords - Tell Others to SCRAM
Safely Protect PostgreSQL Passwords - Tell Others to SCRAM
 
Debugging applications with network security tools
Debugging applications with network security toolsDebugging applications with network security tools
Debugging applications with network security tools
 
Basic cryptography
Basic cryptographyBasic cryptography
Basic cryptography
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code Auditing
 
Coporate Espionage
Coporate EspionageCoporate Espionage
Coporate Espionage
 
Packet Analysis - Course Technology Computing Conference
Packet Analysis - Course Technology Computing ConferencePacket Analysis - Course Technology Computing Conference
Packet Analysis - Course Technology Computing Conference
 
Follow the White Rabbit - Message Queues with PHP
Follow the White Rabbit - Message Queues with PHPFollow the White Rabbit - Message Queues with PHP
Follow the White Rabbit - Message Queues with PHP
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Cryptography
CryptographyCryptography
Cryptography
 
Cassandra
CassandraCassandra
Cassandra
 
Donatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsDonatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIs
 
Server Tips
Server TipsServer Tips
Server Tips
 

rspamd-fosdem

  • 1.
  • 3. Why Rspamd? SpamAssassin → rspamd in “Rambler” Mail
  • 6. Rspamd in a nutshell • Written in plain C • Uses non-blocking (event-driven) data processing model • Oriented on performance • Plugins and rules are written in Lua • Supports all major filtering methods • Packaged for the most of *nix systems
  • 8. Major spam classes • Fraud and scam: • “Nigerian” mail • Phishing (banks, social networks) • Advertisement: • Plain text • Pictures, obfuscated HTML etc • Spam from the social networks notifications
  • 12. Filtering methods Basic principles • Policies (particularly efficient against fraud) • DNS lists (for IP and URLs) • SPF, DKIM, DMARC • Sender’s reputation • Static content filtering (patterns match) • Statistics and ML (e.g. Bayesian)
  • 13. Filtering methods Difficulties • Policies: • Require network requests with unpredictable delay • False positive (e.g. when forwarding) • Static Content filtering: • CPU intensive • Unlikely to be useful long-term • Statistics: • Not very precise • Hard to adopt for multi-user systems
  • 14. Rspamd special features • Text lemmatisation and normalisation (snowball stemmer) • Rspamd converts all texts to utf-8 • Support for users settings in JSON • Possibility to have multiple metrics for a message • Web interface • Rspamd can load SpamAssassin rules directly • OpenSource (BSD 2-clause)
  • 15. Rspamd integration Using Rmilter MTA Rmilter Rspamd Storage Message and SMTP DATA HTTP request Redis cache JSON: action, symbolsDelay, modify or reject Greylisting, limits Statistics, shared cache ✉✉✉✉ ✉ ✉ 🚮 Inbox Spam ✉ Reject ✉
  • 16. Rspamd integration milter • rmilter + postfix/sendmail: • supports all rspamd features (e.g. greylisting action) • can split inbound and outbound messages • can whitelist reply to own messages • can add DKIM signatures to outgoing email
  • 17. Rspamd integration Other MTA • exim - supports rspamd internally (from 4.86) • haraka - supports rspamd internally • any other MTA - LDA mode when rspamc checks message on rspamd and calls delivery agent with altered message using Unix pipes
  • 19. Development principles • Performance and scalability • Filtering quality and extensions • Security • Simplicity and documentation coverage
  • 20. Rules in Rspamd Pre-filters Filters Post-filters Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Policy rules Content rules Statistics rules Settings, ratelimit Main scan Postprocessing
  • 22. Event driven model Never blocks* • Advantages: ✅ Can process network replies asynchronously ✅ Sends all network request simultaneously ✅ Can process many messages in parallel, scales well • Disadvantages: 📛 Callbacks hell ⛔ Memory consumption can go wild *almost never
  • 23. Sequential processing Traditional approach Rule 1 Rule 2 Rule 3 DNS Hashes Wait Wait Processing
  • 24. Event-driven processing Rule 1 Rule 2 Rule 3 DNS Hashes Wait Processing
  • 25. Rule 1 Rule 1 Rules DNS Hashes Wait Event-driven processing What really happens?
  • 26. Event-driven processing • Rspamd can send thousands of DNS requests: 
 time: 5540.8ms real, 2427.4ms virtual, dns req: 120543
  • 27. Event-driven processing • Rspamd can send thousands of DNS requests: 
 time: 5540.8ms real, 2427.4ms virtual, dns req: 120543 • For short messages waiting prevails on the processing time: 
 time: 996.140ms real, 22.000ms virtual,
  • 28. Statistics in rspamd • Uses Markov chains for statistic tokens (using sparse bi-gramms model) • Considers metadata in statistics: mime structure, headers and so on • Statistics could be stored in redis or sqlite3 • Rspamd supports per-user, per-domain statistics, multiple classifiers and autolearning
  • 29. Tokens generation Quick brown fox jumps over lazy dog 1 2 3 4 1 2
  • 30. Statistics example Hard case: images spam 0% 25% 50% 75% 100% Spam Ham 8%5% 92%95% Correctly recognized Unrecognized
  • 31. Fuzzy hashes Introduction • Can find similar or close to similar messages • Rspamd uses fuzzy logic for text and exact hashes for attachments • Storage engine - sqlite3
  • 32. Shingles Algorithm Quick brown fox jumps over lazy dog w1 w2 w3 w2 w3 w4 w3 w4 w5 w4 w5 w6 w1 w2 w3 w2 w3 w4 w3 w4 w5 w4 w5 w6 N hashes h1 h2 h3 h4 h1’ h2’ h3’ h4’
  • 33. Shingles Algorithm h1 h2 h3 h1’ h2’ h3’ h1’’ h2’’ h3’’ h1’’’’’ h2’’’’ h3’’’’ … … … … … min min min min … N shinglesN hash pipes
  • 34. Shingles Algorithm • Probabilistic in general • Uses siphash with different keys • Window size - 3 words, number of hashes - 32
  • 36. Main Principles • Always consider performance. • If something could be skipped, skip it • Use more specific solution if it is faster than generic one (special FSM, asm fragments) • If it is possible to do something fast but not 100% accurate, try faster method
  • 37. Rules Optimisation Global optimisation • Stop checks if a message has reached threshold • Prioritise rules with negative weight • Prioritise faster and more frequent rules: • Check weight, hit probability and average time • Use greedy sorting algorithm
  • 38. Rules Optimisation Local optimisations • Build Abstract Syntax Tree for each rule • Tree is periodically resorted in respect of hit probability and execution time • Regular expressions are optimised using PCRE +JIT and Hyperscan (from 1.1) • Lua code is optimised by means of LuaJIT
  • 39. AST Optimisations Branches cut & |C ! B A A = 0, B = 1, C = 0 0 1 10 1 0 Order of processing 4/6 branches skipped (!A or B) and C
  • 40. AST optimisations N-ary operations > + 2 ! B A Check order C D E What we compare? Here is our limit Stop processing (!A+B+C+D+E) > 2
  • 41. C Library optimisations • Accelerated base32/base64 • Optimised string functions • Fixed size strings • Own printf implementation • Own logger layer • Assembly versions for crypto ciphers and hashes
  • 43. Hyperscan in Rspamd Problem statement • Need to match many regular expressions for the same data: len: 610591, time: 2492.457ms real, 882.251ms virtual regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre • No need to capture patterns • Need to match both UTF8 and raw data • Need to support Perl compatible syntax
  • 44. Naive solution • Join all expressions with ‘|’: /abc/ + /cde/ => /(?:abc)|(?:cde)/ • Won’t work: /abc/ + /a/ => /(?:abc)|(?:a)/ -> the second rule won’t match for ‘abc’
  • 46. Rules and Hyperscan From Subject Text Part HTML Part Attachment Header RE class Header RE class Text Parts class Raw Body class Whole message m1 m2 m3 m4 m5 m6 MIME message Hyperscan regexps
  • 47. Problems with Hyperscan • Slow compile time (like 100 times slower than PCRE+JIT) • Many PCRE features are unsupported (lookahead, look-behind) • Is not packaged for the major OSes (requires relatively fresh boost)
  • 48. Lazy hyperscan compile HS compiler Main Scanners Compiled class Compiled class Compiled class Use PCRE only Compile to disk
  • 49. Lazy hyperscan compile HS compiler Main Scanners Compiled class Compiled class Compiled class Use PCRE only All done Broadcast message
  • 50. Lazy hyperscan compile HS compiler Main Scanners Compiled class Compiled class Compiled class Switch to hyperscan Loaded from cache Wait for ‘recompile’ command
  • 51. Lazy hyperscan compile • Does not affect the starting time • If nothing changes then scanners load pre- compiled expressions on startup (using crypto hashes) • Recompile command is used to force recompilation process: rspamadm recompile
  • 52. Unsupported features Pre-filter mode Text /re1/pre /re2/ /re3/pre /re4/ /re5/ /re6/ /re7/ Scan Maybe match? Match Match Check PCRE Match! Replaces unsupported constructs with "weaker" supported ones, guaranteeing a match superset:
  • 53. Hyperscan results Use Hyperscan for the vast majority of RE rules: len: 610591, time: 2492.457ms real, 882.251ms virtual regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre len: 610591, time: 654.596ms real, 309.785ms virtual regexp statistics: 34 pcre regexps scanned, 41 regexps matched, 8.41M bytes scanned using pcre, 9.56M bytes scanned total
  • 55. Main Principles • Use secure C programming guidelines • Perform static analysis for the code including special clang plugin for own printf format safety • Deal with malicious messages • Limit and protect DNS subsystem • Encrypt and protect all data passed over the network
  • 56. Examples • An email can contain thousands of objects to check (e.g. URLs) • SPF records could have infinite recursion • Need limits for network checks and memoization of the results
  • 57. DNS protection • Cryptographic DNS id generation • Socket pool with constant rotation • Source port randomisation • Input filtering (+ IDN encoding)
  • 58. Traffic Encryption • Security -> performance -> simplicity • TLS is hard in non-blocking IO • TLS does a lot of unnecessary (in case of rspamd) operations increasing thus CPU load and latency
  • 66. Why HTTPCrypt is fast • Faster key exchange (ECDH) • Server does not need to sign anything
 (needed for replay protection in TLS) • Encryption is done in-place • No replay protection for the first message - still suitable for idempotent scanning
  • 68. Conclusions • Rspamd is designed to work as fast as possible • Rspamd has a rich and actively maintained collection of rules and plugins and can re-use existing SA rules • It is possible to encrypt all data transferred over the network in rspamd ecosystem
  • 69. Open issues • No Debian maintainer (any volunteers?) • No dynamic rules updates (planned for 1.2) yet • Rapid development (hard to keep compatibility)
  • 72. Rspamd processes ✉ Scanning processes Scanning processes Scanner processes Controller Service processes Main process ✉✉ HTTP 📬 ✉✉✉ Learn Messages Results
  • 73. Configuration evolution 1. Grammar parser (lex + yacc) ⛔ Hard to manage ⛔ Hard to extend 2. XML ⛔ Unreadable ⛔ Problems with expressions (A &gt; B) 3. UCL - universal configuration language ✅ Easy to manage (looks like nginx.conf) ✅ Macro support ✅ JSON data model (can be used as JSON parser)
  • 74. UCL building blocks • Sections • Arrays • Variables • Macros • Comments section { key = “value”; number = 10K; } upstreams = [ “localhost:80”, “example.com:8080”, ] static_dir = “${WWWDIR}/“; filepath = “${CURDIR}/data”; .include “${CONFDIR}/workers.conf” .include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf” .lua { print(“hey!”); } key = value; // Single line comment /* Multiline comment /* can also be nested */ */
  • 75. Configuration components Global options Workers configuration Scores (metrics) Modules configuration Statistics • Each component is normally included to the main configuration • rspamd.local.conf is used to extend configuration • rspamd.override.conf is used to override values in the configuration • It is possible to use numeric multipliers: “k/m/g” or “ms/s/m/h/d” for time values
  • 76. Lua rules • Most of rules are defined in Lua configuration • Two types of Lua rules: • Regexp rules (look like strings) • Lua functions (pure Lua code)
  • 77. Lua rules Some examples -- Outlook versions that should be excluded from summary rule local fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H' local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H' -- Summary rule for forged outlook reconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s', forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid) • Regexp rule • Lua rule rspamd_config.R_EMPTY_IMAGE = function(task) local tp = task:get_text_parts() -- get text parts in a message for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs` if p:is_html() then -- if the current part is html part local hc = p:get_html() -- we get HTML context local len = p:get_length() -- and part's length if len < 50 then -- if we have a part that has less than 50 bytes of text local images = hc:get_images() -- then we check for HTML images if images then -- if there are images for _,i in ipairs(images) do -- then iterate over images in the part if i['height'] + i['width'] >= 400 then -- if we have a large image return true -- add symbol end end end end end end end
  • 78. Pure LUA functions Review • Are very powerful • Have access to all information from rspamd via lua API: https://rspamd.com/doc/lua/ • Are very fast since C <-> LUA interaction is cheap • Can use zero-copy objects called rspamd{text} to avoid copying when moving data between C and LUA
  • 79. Pure LUA functions • Variables: • Conditionals: • Loops: • Tables: • Functions: • Closures: local ret = false -- Generic variable local rules = {} -- Empty table local rspamd_logger = require “rspamd_logger" -- Load rspamd module if not ret then -- can use ‘not’, ‘and’, ‘or’ here … elseif ret ~= 10 then -- note ~= for ‘not equal’ operator end for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = value for _,i in ipairs(images) do … end -- Iterate over array table a[1] = value for i=1,10 do … end -- Count from 1 to 10 local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1 another_key = function(task) … end, -- Functions can be values [2] = {} -- Other tables can be values } -- Can have both numbers and strings as key and anything as values local function something(task) -- Normal definition local cb = function(data) -- Functions can be nested … end end local function gen_closure(option) local ret = false -- Local variable return function(task) task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here end end rspamd_config.SYMBOL = gen_closure(‘some_option’)
  • 80. Pure LUA functions Generic recommendations • Use local whenever possible (otherwise, global variables are expensive) • Callbacks, closures and recursion are generally cheap (when using LuaJIT) • Do not mix string and number keys in tables, that makes them hard to iterate • ipairs and pairs are not equal • Strings are constant in LUA
  • 81. Regexp rules Types • Can work with the following elements: • Headers: Message-Id=/^something$/H • Mime parts: /some word/P • Raw messages: /some pattern/M • URLs: /example.com/U • Some new flags are added: • UTF8 flag: /u
  • 82. Regexp rules Generic information • Can be combined using the following operators: • AND: /something/P && Subject=/some/H • OR: /something/P || Subject=/some/H • NOT: !/something/P • PLUS: /A/P + /B/P + /C/P >= 2 • Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS • Braces can change priority: !A AND (B OR C)
  • 83. Regexp rules Performance considerations • Avoid message regexps at any cost (use trie instead) • Regexp expressions are highly optimised in rspamd and unnecessary evaluations are not performed • UTF regexps are more expensive than default ones (but could be useful sometimes) • Always use the appropriate type of expression (e.g. url for links and part for textual content)
  • 84. Trie matching • Perfect for fast raw message and text pattern matching • Scales almost linearly with input size (aho-corasic algorithm) • Can handle thousands and hundreds of thousands of patterns (is a base for all antivirus scanners) • Highly optimised for 64 bit systems