Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

rspamd-fosdem

1,857 views

Published on

  • Be the first to comment

rspamd-fosdem

  1. 1. Part I: Introduction
  2. 2. Why Rspamd? SpamAssassin → rspamd in “Rambler” Mail
  3. 3. Rspamd in action Real-time stats
  4. 4. Rspamd in action Web UI
  5. 5. Rspamd in a nutshell • Written in plain C • Uses non-blocking (event-driven) data processing model • Oriented on performance • Plugins and rules are written in Lua • Supports all major filtering methods • Packaged for the most of *nix systems
  6. 6. Spam filtering
  7. 7. Major spam classes • Fraud and scam: • “Nigerian” mail • Phishing (banks, social networks) • Advertisement: • Plain text • Pictures, obfuscated HTML etc • Spam from the social networks notifications
  8. 8. Scam:a typical example
  9. 9. Advertisement: images
  10. 10. Advertisement: images
  11. 11. Filtering methods Basic principles • Policies (particularly efficient against fraud) • DNS lists (for IP and URLs) • SPF, DKIM, DMARC • Sender’s reputation • Static content filtering (patterns match) • Statistics and ML (e.g. Bayesian)
  12. 12. Filtering methods Difficulties • Policies: • Require network requests with unpredictable delay • False positive (e.g. when forwarding) • Static Content filtering: • CPU intensive • Unlikely to be useful long-term • Statistics: • Not very precise • Hard to adopt for multi-user systems
  13. 13. Rspamd special features • Text lemmatisation and normalisation (snowball stemmer) • Rspamd converts all texts to utf-8 • Support for users settings in JSON • Possibility to have multiple metrics for a message • Web interface • Rspamd can load SpamAssassin rules directly • OpenSource (BSD 2-clause)
  14. 14. Rspamd integration Using Rmilter MTA Rmilter Rspamd Storage Message and SMTP DATA HTTP request Redis cache JSON: action, symbolsDelay, modify or reject Greylisting, limits Statistics, shared cache ✉✉✉✉ ✉ ✉ 🚮 Inbox Spam ✉ Reject ✉
  15. 15. Rspamd integration milter • rmilter + postfix/sendmail: • supports all rspamd features (e.g. greylisting action) • can split inbound and outbound messages • can whitelist reply to own messages • can add DKIM signatures to outgoing email
  16. 16. Rspamd integration Other MTA • exim - supports rspamd internally (from 4.86) • haraka - supports rspamd internally • any other MTA - LDA mode when rspamc checks message on rspamd and calls delivery agent with altered message using Unix pipes
  17. 17. Part II: Architecture
  18. 18. Development principles • Performance and scalability • Filtering quality and extensions • Security • Simplicity and documentation coverage
  19. 19. Rules in Rspamd Pre-filters Filters Post-filters Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Policy rules Content rules Statistics rules Settings, ratelimit Main scan Postprocessing
  20. 20. Message lifecycle RulesRules Filters RulesRules Pre-filters RulesRules Post-filters 📩 Message 📬 Result Wait Wait Wait for dependencies
  21. 21. Event driven model Never blocks* • Advantages: ✅ Can process network replies asynchronously ✅ Sends all network request simultaneously ✅ Can process many messages in parallel, scales well • Disadvantages: 📛 Callbacks hell ⛔ Memory consumption can go wild *almost never
  22. 22. Sequential processing Traditional approach Rule 1 Rule 2 Rule 3 DNS Hashes Wait Wait Processing
  23. 23. Event-driven processing Rule 1 Rule 2 Rule 3 DNS Hashes Wait Processing
  24. 24. Rule 1 Rule 1 Rules DNS Hashes Wait Event-driven processing What really happens?
  25. 25. Event-driven processing • Rspamd can send thousands of DNS requests: 
 time: 5540.8ms real, 2427.4ms virtual, dns req: 120543
  26. 26. Event-driven processing • Rspamd can send thousands of DNS requests: 
 time: 5540.8ms real, 2427.4ms virtual, dns req: 120543 • For short messages waiting prevails on the processing time: 
 time: 996.140ms real, 22.000ms virtual,
  27. 27. Statistics in rspamd • Uses Markov chains for statistic tokens (using sparse bi-gramms model) • Considers metadata in statistics: mime structure, headers and so on • Statistics could be stored in redis or sqlite3 • Rspamd supports per-user, per-domain statistics, multiple classifiers and autolearning
  28. 28. Tokens generation Quick brown fox jumps over lazy dog 1 2 3 4 1 2
  29. 29. Statistics example Hard case: images spam 0% 25% 50% 75% 100% Spam Ham 8%5% 92%95% Correctly recognized Unrecognized
  30. 30. Fuzzy hashes Introduction • Can find similar or close to similar messages • Rspamd uses fuzzy logic for text and exact hashes for attachments • Storage engine - sqlite3
  31. 31. Shingles Algorithm Quick brown fox jumps over lazy dog w1 w2 w3 w2 w3 w4 w3 w4 w5 w4 w5 w6 w1 w2 w3 w2 w3 w4 w3 w4 w5 w4 w5 w6 N hashes h1 h2 h3 h4 h1’ h2’ h3’ h4’
  32. 32. Shingles Algorithm h1 h2 h3 h1’ h2’ h3’ h1’’ h2’’ h3’’ h1’’’’’ h2’’’’ h3’’’’ … … … … … min min min min … N shinglesN hash pipes
  33. 33. Shingles Algorithm • Probabilistic in general • Uses siphash with different keys • Window size - 3 words, number of hashes - 32
  34. 34. Part III: Performance
  35. 35. Main Principles • Always consider performance. • If something could be skipped, skip it • Use more specific solution if it is faster than generic one (special FSM, asm fragments) • If it is possible to do something fast but not 100% accurate, try faster method
  36. 36. Rules Optimisation Global optimisation • Stop checks if a message has reached threshold • Prioritise rules with negative weight • Prioritise faster and more frequent rules: • Check weight, hit probability and average time • Use greedy sorting algorithm
  37. 37. Rules Optimisation Local optimisations • Build Abstract Syntax Tree for each rule • Tree is periodically resorted in respect of hit probability and execution time • Regular expressions are optimised using PCRE +JIT and Hyperscan (from 1.1) • Lua code is optimised by means of LuaJIT
  38. 38. AST Optimisations Branches cut & |C ! B A A = 0, B = 1, C = 0 0 1 10 1 0 Order of processing 4/6 branches skipped (!A or B) and C
  39. 39. AST optimisations N-ary operations > + 2 ! B A Check order C D E What we compare? Here is our limit Stop processing (!A+B+C+D+E) > 2
  40. 40. C Library optimisations • Accelerated base32/base64 • Optimised string functions • Fixed size strings • Own printf implementation • Own logger layer • Assembly versions for crypto ciphers and hashes
  41. 41. Hyperscan in Rspamd Multi-regexp match engine
  42. 42. Hyperscan in Rspamd Problem statement • Need to match many regular expressions for the same data: len: 610591, time: 2492.457ms real, 882.251ms virtual regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre • No need to capture patterns • Need to match both UTF8 and raw data • Need to support Perl compatible syntax
  43. 43. Naive solution • Join all expressions with ‘|’: /abc/ + /cde/ => /(?:abc)|(?:cde)/ • Won’t work: /abc/ + /a/ => /(?:abc)|(?:a)/ -> the second rule won’t match for ‘abc’
  44. 44. Hyperscan Text /re1/ /re2/ /re3/ /re4/ /re5/ /re6/ /re7/ Scan Match Match Match
  45. 45. Rules and Hyperscan From Subject Text Part HTML Part Attachment Header RE class Header RE class Text Parts class Raw Body class Whole message m1 m2 m3 m4 m5 m6 MIME message Hyperscan regexps
  46. 46. Problems with Hyperscan • Slow compile time (like 100 times slower than PCRE+JIT) • Many PCRE features are unsupported (lookahead, look-behind) • Is not packaged for the major OSes (requires relatively fresh boost)
  47. 47. Lazy hyperscan compile HS compiler Main Scanners Compiled class Compiled class Compiled class Use PCRE only Compile to disk
  48. 48. Lazy hyperscan compile HS compiler Main Scanners Compiled class Compiled class Compiled class Use PCRE only All done Broadcast message
  49. 49. Lazy hyperscan compile HS compiler Main Scanners Compiled class Compiled class Compiled class Switch to hyperscan Loaded from cache Wait for ‘recompile’ command
  50. 50. Lazy hyperscan compile • Does not affect the starting time • If nothing changes then scanners load pre- compiled expressions on startup (using crypto hashes) • Recompile command is used to force recompilation process: rspamadm recompile
  51. 51. Unsupported features Pre-filter mode Text /re1/pre /re2/ /re3/pre /re4/ /re5/ /re6/ /re7/ Scan Maybe match? Match Match Check PCRE Match! Replaces unsupported constructs with "weaker" supported ones, guaranteeing a match superset:
  52. 52. Hyperscan results Use Hyperscan for the vast majority of RE rules: len: 610591, time: 2492.457ms real, 882.251ms virtual regexp statistics: 4095 pcre regexps scanned, 18 regexps matched, 694M bytes scanned using pcre len: 610591, time: 654.596ms real, 309.785ms virtual regexp statistics: 34 pcre regexps scanned, 41 regexps matched, 8.41M bytes scanned using pcre, 9.56M bytes scanned total
  53. 53. Part IV: Security
  54. 54. Main Principles • Use secure C programming guidelines • Perform static analysis for the code including special clang plugin for own printf format safety • Deal with malicious messages • Limit and protect DNS subsystem • Encrypt and protect all data passed over the network
  55. 55. Examples • An email can contain thousands of objects to check (e.g. URLs) • SPF records could have infinite recursion • Need limits for network checks and memoization of the results
  56. 56. DNS protection • Cryptographic DNS id generation • Socket pool with constant rotation • Source port randomisation • Input filtering (+ IDN encoding)
  57. 57. Traffic Encryption • Security -> performance -> simplicity • TLS is hard in non-blocking IO • TLS does a lot of unnecessary (in case of rspamd) operations increasing thus CPU load and latency
  58. 58. HTTPCrypt Protocol
  59. 59. Connection Establishment ClientHelo ServerHelo Certificates⚠ ClientChangeCipher ServerChangeCipher Data TLS handshake ClientPK Encrypted request Encrypted reply Validation (optional) HTTPCrypt handshake
  60. 60. Throughput Cached connections
  61. 61. Throughput New connections
  62. 62. Throughput New connections
  63. 63. Connection Latency Cached connections
  64. 64. Connection Latency New connections
  65. 65. Why HTTPCrypt is fast • Faster key exchange (ECDH) • Server does not need to sign anything
 (needed for replay protection in TLS) • Encryption is done in-place • No replay protection for the first message - still suitable for idempotent scanning
  66. 66. Conclusions
  67. 67. Conclusions • Rspamd is designed to work as fast as possible • Rspamd has a rich and actively maintained collection of rules and plugins and can re-use existing SA rules • It is possible to encrypt all data transferred over the network in rspamd ecosystem
  68. 68. Open issues • No Debian maintainer (any volunteers?) • No dynamic rules updates (planned for 1.2) yet • Rapid development (hard to keep compatibility)
  69. 69. Questions? Vsevolod Stakhov https://rspamd.com
  70. 70. Appendix
  71. 71. Rspamd processes ✉ Scanning processes Scanning processes Scanner processes Controller Service processes Main process ✉✉ HTTP 📬 ✉✉✉ Learn Messages Results
  72. 72. Configuration evolution 1. Grammar parser (lex + yacc) ⛔ Hard to manage ⛔ Hard to extend 2. XML ⛔ Unreadable ⛔ Problems with expressions (A > B) 3. UCL - universal configuration language ✅ Easy to manage (looks like nginx.conf) ✅ Macro support ✅ JSON data model (can be used as JSON parser)
  73. 73. UCL building blocks • Sections • Arrays • Variables • Macros • Comments section { key = “value”; number = 10K; } upstreams = [ “localhost:80”, “example.com:8080”, ] static_dir = “${WWWDIR}/“; filepath = “${CURDIR}/data”; .include “${CONFDIR}/workers.conf” .include (glob=true,priority=2) “${CONFDIR}/conf.d/*.conf” .lua { print(“hey!”); } key = value; // Single line comment /* Multiline comment /* can also be nested */ */
  74. 74. Configuration components Global options Workers configuration Scores (metrics) Modules configuration Statistics • Each component is normally included to the main configuration • rspamd.local.conf is used to extend configuration • rspamd.override.conf is used to override values in the configuration • It is possible to use numeric multipliers: “k/m/g” or “ms/s/m/h/d” for time values
  75. 75. Lua rules • Most of rules are defined in Lua configuration • Two types of Lua rules: • Regexp rules (look like strings) • Lua functions (pure Lua code)
  76. 76. Lua rules Some examples -- Outlook versions that should be excluded from summary rule local fmo_excl_o3416 = 'X-Mailer=/^Microsoft Outlook, Build 10.0.3416$/H' local fmo_excl_oe3790 = 'X-Mailer=/^Microsoft Outlook Express 6.00.3790.3959$/H' -- Summary rule for forged outlook reconf['FORGED_MUA_OUTLOOK'] = string.format('(%s | %s) & !%s & !%s & !%s', forged_oe, forged_outlook_dollars, fmo_excl_o3416, fmo_excl_oe3790, vista_msgid) • Regexp rule • Lua rule rspamd_config.R_EMPTY_IMAGE = function(task) local tp = task:get_text_parts() -- get text parts in a message for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs` if p:is_html() then -- if the current part is html part local hc = p:get_html() -- we get HTML context local len = p:get_length() -- and part's length if len < 50 then -- if we have a part that has less than 50 bytes of text local images = hc:get_images() -- then we check for HTML images if images then -- if there are images for _,i in ipairs(images) do -- then iterate over images in the part if i['height'] + i['width'] >= 400 then -- if we have a large image return true -- add symbol end end end end end end end
  77. 77. Pure LUA functions Review • Are very powerful • Have access to all information from rspamd via lua API: https://rspamd.com/doc/lua/ • Are very fast since C <-> LUA interaction is cheap • Can use zero-copy objects called rspamd{text} to avoid copying when moving data between C and LUA
  78. 78. Pure LUA functions • Variables: • Conditionals: • Loops: • Tables: • Functions: • Closures: local ret = false -- Generic variable local rules = {} -- Empty table local rspamd_logger = require “rspamd_logger" -- Load rspamd module if not ret then -- can use ‘not’, ‘and’, ‘or’ here … elseif ret ~= 10 then -- note ~= for ‘not equal’ operator end for k,m in pairs(opts) do … end -- Iterate over keyed table a[‘key’] = value for _,i in ipairs(images) do … end -- Iterate over array table a[1] = value for i=1,10 do … end -- Count from 1 to 10 local options = { [1] = ‘value’, [‘key’] = 1, -- Numbers starts from 1 another_key = function(task) … end, -- Functions can be values [2] = {} -- Other tables can be values } -- Can have both numbers and strings as key and anything as values local function something(task) -- Normal definition local cb = function(data) -- Functions can be nested … end end local function gen_closure(option) local ret = false -- Local variable return function(task) task:do_something(ret, option) -- Both ‘ret’ and ‘option’ are accessible here end end rspamd_config.SYMBOL = gen_closure(‘some_option’)
  79. 79. Pure LUA functions Generic recommendations • Use local whenever possible (otherwise, global variables are expensive) • Callbacks, closures and recursion are generally cheap (when using LuaJIT) • Do not mix string and number keys in tables, that makes them hard to iterate • ipairs and pairs are not equal • Strings are constant in LUA
  80. 80. Regexp rules Types • Can work with the following elements: • Headers: Message-Id=/^something$/H • Mime parts: /some word/P • Raw messages: /some pattern/M • URLs: /example.com/U • Some new flags are added: • UTF8 flag: /u
  81. 81. Regexp rules Generic information • Can be combined using the following operators: • AND: /something/P && Subject=/some/H • OR: /something/P || Subject=/some/H • NOT: !/something/P • PLUS: /A/P + /B/P + /C/P >= 2 • Priority goes as following: NOT ➡ AND ➡ OR ➡ PLUS • Braces can change priority: !A AND (B OR C)
  82. 82. Regexp rules Performance considerations • Avoid message regexps at any cost (use trie instead) • Regexp expressions are highly optimised in rspamd and unnecessary evaluations are not performed • UTF regexps are more expensive than default ones (but could be useful sometimes) • Always use the appropriate type of expression (e.g. url for links and part for textual content)
  83. 83. Trie matching • Perfect for fast raw message and text pattern matching • Scales almost linearly with input size (aho-corasic algorithm) • Can handle thousands and hundreds of thousands of patterns (is a base for all antivirus scanners) • Highly optimised for 64 bit systems

×