Realtime traffic analyser

Lessons we learned while building
real-time network traffic analyzer in
C/C++
Alex Moskvin
CEO/CTO @ Plexteq

About myself
• CEO/CTO Plexteq OÜ
• Ph.D in information technology area
• Interests
• Software architecture
• High loaded systems
• Everything under the hood
• AI/ML + BigData
• Knowledge sharing ;)
• Follow me
• https://twitter.com/amoskvin
• https://www.facebook.com/moskvin.aleksey

Plexteq
• High loaded backends
• Complex distributed data processing
pipelines
• Big Data / BI
• We have our custom products
(hardware + software solutions)
We are hiring! ;)

Agenda
1. What was the whole stuff about
2. How we decided to solve it
3. Challenges we faced
4. Lessons we learned

Disclaimer ;)
This talk is based on personal experience.
Use at your own risk.

Task definition
• Network services provider needs:
• Analyse threats/interactions in past
• Realtime network spikes indication
• Aggregate metadata from hundreds of systems
• Solution should be
• fast, resource efficient (no CPU/RAM hogging)
• potentially needs to be cross-platform
• Easy to integrate with ETL and BI systems
• Regular bandwidth: 100-1000Mbps

Data model
2 dimensions
Per port
Time period
Source IP
Destination port
Protocol type
In bytes
Out bytes
In packets
Out packets
Per protocol type
Time period
TCP/UDP/… traffic in
bytes
TCP/UDP/… traffic in
bytes
Protocol type
In bytes
Out bytes
In packets
Out packets

Existing solutions
• tcpdump
• wireshark
• iptables

Existing solutions
$ for i in 1 2 3; do
some tcpdump exercise
done

Existing solutions
$ tcpdump -i eth0
$ tcpdump tcp port 443
$ tcpdump tcp ‘port 443 or port 80’
$ tcpdump tcp ‘port 443 or port 80’ -w out-file

Existing solutions
• Drawbacks
• tcpdump / wireshark
• Single threaded
• Large disk space overhead (without hacking will write packet contents)
• Not possible to write with custom data format (extra parsing efforts of .pcap file is
needed)
• Iptables
• Could work, but will be hard to customize in case of further feature requests
• Not cross-platform

Existing solutions
We want our own bicycle ;)

Main functions
Okay, so we want to capture traffic from the kernel.
How should we do it?

Traffic capturing
• Raw sockets
• pf_ring
• 3rd party libraries
• libtins
• pcapplusplus
• libpcap

Traffic capturing :: Raw sockets

Traffic capturing :: Raw sockets
Drawbacks:
• Kernel-to-userspace copies
• Developer needs to be proficient with
packet structure and low level
networking semantics, i.e. endianness

Traffic capturing :: pf_ring
PF_RING – kernel bypass
Motivation:
• Kernel is very slow 
• Vanilla kernel can handle 1-2Mpps
• PF_RING can do 15+Mpps on commodity hardware
Pros
• Huge workloads
• Could be used for network server application development
• Zero copy technique
Cons
• Complicated API
• Support on network card driver level is preferred
• PF_RING ZC API is complex
• Not cross platform

Traffic capturing :: 3rd party libs
Pros:
• Cross platform
• May utilize low level OS dependent optimizations and extensions, i.e. PF_RING

Traffic capturing :: winner
libpcap
• Cross platform
• Supports PF_RING
• The most fast implementation
• Well maintained
• Relatively easy API

Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices

Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
Drawbacks:
• Single threaded – we need to synchronize/serialize write ops to it in our
application

We have core tool chain now!
Let’s glue it up together

Producer-consumer problem
• Issues:
• Aggregator is not following up on traffic > 25Mbps
• We have a significant increasing delay between incoming traffic and flushed
stats
This is actually a producer-consumer type of problem

We need to handle packets in
multiple threads

• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads

• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads
Possible implementations:
• Message broker
• Blocking queue

We need a blocking queue
For this purpose

Very good implementation: APR (Apache Portable Runtime)
Used by Apache web server
http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html

Packet processing flow
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100+% utilized by our app (eaten by malloc calls)

Memory allocation
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100% utilized by our app (eaten by malloc calls)
• Business logic needed at least 1 malloc when packet stats got aggregated in in-
memory data structure

Malloc issue
Solution:
• Use memory pooling

Malloc issue
Solution:
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)

Malloc issue
Solution:
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)
Drawbacks:
• Can’t do free for an individual
allocation within a block

Packet processing flow
Some implementations
• APR (https://apr.apache.org/docs/apr/1.6/group__apr__pools.html)
• Mpool (https://github.com/silentbicycle/mpool)

Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz

Mutexes
• Results:
• Linux:
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)

Mutexes
• Investigation:
• pthread_mutex_* in Linux is implemented using futexes (fast user-space
mutex), no locking, no context switching
• POSIX is a standard, it doesn’t require specific implementation
• OSX/FreeBSD use heavier approach with

Mutexes
• Thread synhronization approaches:
• Lock based
• Semaphore
• Mutex
• Lock free
• Futex (could lock in an edge case)
• Spin lock
• CAS based spin lock

Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• Custom spin lock based on CAS operations

Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• pthread_spin_lock
• Custom spin lock based on CAS operations
• GCC atomic built ins
• __sync_lock_test_and_set
• __sync_lock_release

Mutexes
1) volatile suggests that “lock”may be changed by other threads
2) __sync_lock_test_and_set, __sync_lock_release
Are atomic built ins which guarantee atomic memory access
3) __sync_lock_test_and_set atomically sets 1 and returns 0
4) If lock == 1, we keep looping until another thread calls
__sync_lock_release

Mutexes
• Results:
• Linux:
• FreeBSD/OSX
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)

Realtime traffic analyser

More Related Content

What's hot

Similar to Realtime traffic analyser

Recently uploaded

Realtime traffic analyser

Editor's Notes