Lessons we learned while building
real-time network traffic analyzer in
C/C++
Alex Moskvin
CEO/CTO @ Plexteq
About myself
• CEO/CTO Plexteq OÜ
• Ph.D in information technology area
• Interests
• Software architecture
• High loaded systems
• Everything under the hood
• AI/ML + BigData
• Knowledge sharing ;)
• Follow me
• https://twitter.com/amoskvin
• https://www.facebook.com/moskvin.aleksey
Plexteq
• High loaded backends
• Complex distributed data processing
pipelines
• Big Data / BI
• We have our custom products
(hardware + software solutions)
We are hiring! ;)
Agenda
1. What was the whole stuff about
2. How we decided to solve it
3. Challenges we faced
4. Lessons we learned
Disclaimer ;)
This talk is based on personal experience.
Use at your own risk.
Task definition
• Network services provider needs:
• Analyse threats/interactions in past
• Realtime network spikes indication
• Aggregate metadata from hundreds of systems
• Solution should be
• fast, resource efficient (no CPU/RAM hogging)
• potentially needs to be cross-platform
• Easy to integrate with ETL and BI systems
• Regular bandwidth: 100-1000Mbps
Data model
2 dimensions
Per port
Time period
Source IP
Destination port
Protocol type
In bytes
Out bytes
In packets
Out packets
Per protocol type
Time period
TCP/UDP/… traffic in
bytes
TCP/UDP/… traffic in
bytes
Protocol type
In bytes
Out bytes
In packets
Out packets
High level architecture
High level architecture
Existing solutions
• tcpdump
• wireshark
• iptables
Existing solutions
$ for i in 1 2 3; do
some tcpdump exercise
done
Existing solutions
$ tcpdump -i eth0
$ tcpdump tcp port 443
$ tcpdump tcp ‘port 443 or port 80’
$ tcpdump tcp ‘port 443 or port 80’ -w out-file
Existing solutions
• Drawbacks
• tcpdump / wireshark
• Single threaded
• Large disk space overhead (without hacking will write packet contents)
• Not possible to write with custom data format (extra parsing efforts of .pcap file is
needed)
• Iptables
• Could work, but will be hard to customize in case of further feature requests
• Not cross-platform
Existing solutions
We want our own bicycle ;)
Main functions
Okay, so we want to capture traffic from the kernel.
How should we do it?
Traffic capturing
• Raw sockets
• pf_ring
• 3rd party libraries
• libtins
• pcapplusplus
• libpcap
Traffic capturing :: Raw sockets
Traffic capturing :: Raw sockets
Drawbacks:
• Kernel-to-userspace copies
• Developer needs to be proficient with
packet structure and low level
networking semantics, i.e. endianness
Traffic capturing :: Raw sockets
Traffic capturing :: pf_ring
PF_RING – kernel bypass
Motivation:
• Kernel is very slow 
• Vanilla kernel can handle 1-2Mpps
• PF_RING can do 15+Mpps on commodity hardware
Pros
• Huge workloads
• Could be used for network server application development
• Zero copy technique
Cons
• Complicated API
• Support on network card driver level is preferred
• PF_RING ZC API is complex
• Not cross platform
Traffic capturing :: 3rd party libs
Pros:
• Cross platform
• May utilize low level OS dependent optimizations and extensions, i.e. PF_RING
Traffic capturing :: winner
libpcap
• Cross platform
• Supports PF_RING
• The most fast implementation
• Well maintained
• Relatively easy API
Traffic capturing :: winner
Traffic capturing :: libpcap
Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
Drawbacks:
• Single threaded – we need to synchronize/serialize write ops to it in our
application
SQLite :: code examples
SQLite :: code examples
We have core tool chain now!
Let’s glue it up together
Packet processing flow
Producer-consumer problem
• Issues:
• Aggregator is not following up on traffic > 25Mbps
• We have a significant increasing delay between incoming traffic and flushed
stats
This is actually a producer-consumer type of problem
Producer-consumer problem
We need to handle packets in
multiple threads
Producer-consumer problem
• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads
Producer-consumer problem
• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads
Possible implementations:
• Message broker
• Blocking queue
Producer-consumer problem
We need a blocking queue
For this purpose
Producer-consumer problem
Very good implementation: APR (Apache Portable Runtime)
Used by Apache web server
http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html
Packet processing flow
Packet processing flow
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100+% utilized by our app (eaten by malloc calls)
Memory allocation
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100% utilized by our app (eaten by malloc calls)
• Business logic needed at least 1 malloc when packet stats got aggregated in in-
memory data structure
Malloc issue
Solution:
• Use memory pooling
Malloc issue
Solution:
• Use memory pooling
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)
Malloc issue
Solution:
• Use memory pooling
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)
Drawbacks:
• Can’t do free for an individual
allocation within a block
Packet processing flow
Some implementations
• APR (https://apr.apache.org/docs/apr/1.6/group__apr__pools.html)
• Mpool (https://github.com/silentbicycle/mpool)
Packet processing flow
Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz
Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)
Mutexes
• Investigation:
• pthread_mutex_* in Linux is implemented using futexes (fast user-space
mutex), no locking, no context switching
• POSIX is a standard, it doesn’t require specific implementation
• OSX/FreeBSD use heavier approach with
Mutexes
Mutexes
• Thread synhronization approaches:
• Lock based
• Semaphore
• Mutex
• Lock free
• Futex (could lock in an edge case)
• Spin lock
• CAS based spin lock
Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• Custom spin lock based on CAS operations
Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• pthread_spin_lock
• Custom spin lock based on CAS operations
• GCC atomic built ins
• __sync_lock_test_and_set
• __sync_lock_release
Mutexes
Mutexes
1) volatile suggests that “lock”may be changed by other threads
2) __sync_lock_test_and_set, __sync_lock_release
Are atomic built ins which guarantee atomic memory access
3) __sync_lock_test_and_set atomically sets 1 and returns 0
4) If lock == 1, we keep looping until another thread calls
__sync_lock_release
Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 8-12% on 4 core Xeon 2.8Ghz
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)
Questions?

Realtime traffic analyser

  • 1.
    Lessons we learnedwhile building real-time network traffic analyzer in C/C++ Alex Moskvin CEO/CTO @ Plexteq
  • 2.
    About myself • CEO/CTOPlexteq OÜ • Ph.D in information technology area • Interests • Software architecture • High loaded systems • Everything under the hood • AI/ML + BigData • Knowledge sharing ;) • Follow me • https://twitter.com/amoskvin • https://www.facebook.com/moskvin.aleksey
  • 3.
    Plexteq • High loadedbackends • Complex distributed data processing pipelines • Big Data / BI • We have our custom products (hardware + software solutions) We are hiring! ;)
  • 4.
    Agenda 1. What wasthe whole stuff about 2. How we decided to solve it 3. Challenges we faced 4. Lessons we learned
  • 5.
    Disclaimer ;) This talkis based on personal experience. Use at your own risk.
  • 6.
    Task definition • Networkservices provider needs: • Analyse threats/interactions in past • Realtime network spikes indication • Aggregate metadata from hundreds of systems • Solution should be • fast, resource efficient (no CPU/RAM hogging) • potentially needs to be cross-platform • Easy to integrate with ETL and BI systems • Regular bandwidth: 100-1000Mbps
  • 7.
    Data model 2 dimensions Perport Time period Source IP Destination port Protocol type In bytes Out bytes In packets Out packets Per protocol type Time period TCP/UDP/… traffic in bytes TCP/UDP/… traffic in bytes Protocol type In bytes Out bytes In packets Out packets
  • 8.
  • 9.
  • 10.
    Existing solutions • tcpdump •wireshark • iptables
  • 11.
    Existing solutions $ fori in 1 2 3; do some tcpdump exercise done
  • 12.
    Existing solutions $ tcpdump-i eth0 $ tcpdump tcp port 443 $ tcpdump tcp ‘port 443 or port 80’ $ tcpdump tcp ‘port 443 or port 80’ -w out-file
  • 13.
    Existing solutions • Drawbacks •tcpdump / wireshark • Single threaded • Large disk space overhead (without hacking will write packet contents) • Not possible to write with custom data format (extra parsing efforts of .pcap file is needed) • Iptables • Could work, but will be hard to customize in case of further feature requests • Not cross-platform
  • 14.
    Existing solutions We wantour own bicycle ;)
  • 15.
    Main functions Okay, sowe want to capture traffic from the kernel. How should we do it?
  • 16.
    Traffic capturing • Rawsockets • pf_ring • 3rd party libraries • libtins • pcapplusplus • libpcap
  • 17.
  • 18.
    Traffic capturing ::Raw sockets Drawbacks: • Kernel-to-userspace copies • Developer needs to be proficient with packet structure and low level networking semantics, i.e. endianness
  • 19.
  • 20.
    Traffic capturing ::pf_ring PF_RING – kernel bypass Motivation: • Kernel is very slow  • Vanilla kernel can handle 1-2Mpps • PF_RING can do 15+Mpps on commodity hardware Pros • Huge workloads • Could be used for network server application development • Zero copy technique Cons • Complicated API • Support on network card driver level is preferred • PF_RING ZC API is complex • Not cross platform
  • 21.
    Traffic capturing ::3rd party libs Pros: • Cross platform • May utilize low level OS dependent optimizations and extensions, i.e. PF_RING
  • 22.
    Traffic capturing ::winner libpcap • Cross platform • Supports PF_RING • The most fast implementation • Well maintained • Relatively easy API
  • 23.
  • 24.
  • 25.
    Solutions to storedata We wanted something that: • Has small footprint and fast • Preferably one file database • Embeddable • Supports SQL • Supports B-tree indices
  • 26.
    Solutions to storedata We wanted something that: • Has small footprint and fast • Preferably one file database • Embeddable • Supports SQL • Supports B-tree indices
  • 27.
    Solutions to storedata We wanted something that: • Has small footprint and fast • Preferably one file database • Embeddable • Supports SQL • Supports B-tree indices Drawbacks: • Single threaded – we need to synchronize/serialize write ops to it in our application
  • 28.
  • 29.
  • 30.
    We have coretool chain now! Let’s glue it up together
  • 31.
  • 32.
    Producer-consumer problem • Issues: •Aggregator is not following up on traffic > 25Mbps • We have a significant increasing delay between incoming traffic and flushed stats This is actually a producer-consumer type of problem
  • 33.
    Producer-consumer problem We needto handle packets in multiple threads
  • 34.
    Producer-consumer problem • Solution: •Producer runs in separate thread • Multiple consumers that run in separate threads
  • 35.
    Producer-consumer problem • Solution: •Producer runs in separate thread • Multiple consumers that run in separate threads Possible implementations: • Message broker • Blocking queue
  • 36.
    Producer-consumer problem We needa blocking queue For this purpose
  • 37.
    Producer-consumer problem Very goodimplementation: APR (Apache Portable Runtime) Used by Apache web server http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html
  • 38.
  • 39.
    Packet processing flow •Issues: • Application is capable to handle about 82Mbps of traffic flow • CPU usage is 100+% utilized by our app (eaten by malloc calls)
  • 40.
    Memory allocation • Issues: •Application is capable to handle about 82Mbps of traffic flow • CPU usage is 100% utilized by our app (eaten by malloc calls) • Business logic needed at least 1 malloc when packet stats got aggregated in in- memory data structure
  • 41.
  • 42.
    Malloc issue Solution: • Usememory pooling Blockpre-allocate withmalloc Allocations within a block (eventually allocation within block = pointer arithmetic)
  • 43.
    Malloc issue Solution: • Usememory pooling Blockpre-allocate withmalloc Allocations within a block (eventually allocation within block = pointer arithmetic) Drawbacks: • Can’t do free for an individual allocation within a block
  • 44.
    Packet processing flow Someimplementations • APR (https://apr.apache.org/docs/apr/1.6/group__apr__pools.html) • Mpool (https://github.com/silentbicycle/mpool)
  • 45.
  • 46.
    Mutexes • Results: • Linux: •Application is capable to handle ~1Gbps of traffic flow • CPU usage is 10-15% on 4 core Xeon 2.8Ghz • FreeBSD/OSX • Application is capable to handle ~615Mbps of traffic flow • CPU usage is 35% on 4 core Xeon 2.8Ghz
  • 47.
    Mutexes • Results: • Linux: •Application is capable to handle ~1Gbps of traffic flow • CPU usage is 10-15% on 4 core Xeon 2.8Ghz • FreeBSD/OSX • Application is capable to handle ~615Mbps of traffic flow • CPU usage is 35% on 4 core Xeon 2.8Ghz • Possible reasons • Profiler shows a high number of thread synchronization calls from our app (pthread_mutex_lock, pthread_mutex_unlock)
  • 48.
    Mutexes • Investigation: • pthread_mutex_*in Linux is implemented using futexes (fast user-space mutex), no locking, no context switching • POSIX is a standard, it doesn’t require specific implementation • OSX/FreeBSD use heavier approach with
  • 49.
  • 50.
    Mutexes • Thread synhronizationapproaches: • Lock based • Semaphore • Mutex • Lock free • Futex (could lock in an edge case) • Spin lock • CAS based spin lock
  • 51.
    Mutexes • Our targetcritical section: • No IO operations • Just pointer operations, arithmetic operations and allocations on memory pool • Options • Spin lock from OS • Custom spin lock based on CAS operations
  • 52.
    Mutexes • Our targetcritical section: • No IO operations • Just pointer operations, arithmetic operations and allocations on memory pool • Options • Spin lock from OS • pthread_spin_lock • Custom spin lock based on CAS operations • GCC atomic built ins • __sync_lock_test_and_set • __sync_lock_release
  • 53.
  • 54.
    Mutexes 1) volatile suggeststhat “lock”may be changed by other threads 2) __sync_lock_test_and_set, __sync_lock_release Are atomic built ins which guarantee atomic memory access 3) __sync_lock_test_and_set atomically sets 1 and returns 0 4) If lock == 1, we keep looping until another thread calls __sync_lock_release
  • 55.
    Mutexes • Results: • Linux: •Application is capable to handle ~1Gbps of traffic flow • CPU usage is 10-15% on 4 core Xeon 2.8Ghz • FreeBSD/OSX • Application is capable to handle ~1Gbps of traffic flow • CPU usage is 8-12% on 4 core Xeon 2.8Ghz • Possible reasons • Profiler shows a high number of thread synchronization calls from our app (pthread_mutex_lock, pthread_mutex_unlock)
  • 56.

Editor's Notes

  • #3 Знания - nda
  • #4 Знания - nda
  • #21 packets will only be delivered to the PF_RING client, and not the kernel network stack. Since the kernel is the slow part this ensures the fastest operation
  • #29 If database doesn’t exist it got created
  • #30 If database doesn’t exist it got created
  • #34 How? Libpcap is single threaded
  • #35 How to join producers with consumers?
  • #36 How to join producers with consumers?
  • #37 What is blocking queue?
  • #38 What is blocking queue?
  • #39 What is blocking queue?
  • #45 Widely used in server applications – 1 request = 1 pool
  • #46 Widely used in server applications – 1 request = 1 pool
  • #49 Any propositions?
  • #50 Any propositions?
  • #51 Any propositions?
  • #52 Any propositions?
  • #53 Any propositions?
  • #54 Any propositions?
  • #55 Any propositions?