Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx
Scaling ingest pipelines with
high performance computing principles
Rajiv Kurian, Software Engineer
rajiv@signalfx.com

Agenda
1. Why we need to scale ingest
2. Basic properties and limitations of modern
hardware
3. Optimization techniques inspired by HPC
4. Results!
5. Q&A (hopefully!)

SignalFx
Why we need to scale ingest

• High resolution:
• Up to 1 sec
• Streaming analytics:
• Charts/analytics update @1sec
• Real time
• Multidimensional metrics:
• Dimensions : representing customer, server etc
• Filter, aggregate : 99th-pct-latency-by-service,customer
SignalFx is an advanced monitoring platform for modern applications

Ingest pipeline
ROLLUPS PERSIST
REST/RATE
CONTROL
Raw time
series data
Processed data to
analytics

SignalFx ingest library
Raw data in Rollup data out
TimeSeries 0 rollup
TimeSeries 1 rollup
TimeSeries 2 rollup
TimeSeries 3 rollup
TimeSeries 4 rollup
TimeSeries 5 rollup
TimeSeries 6 rollup
TimeSeries 7 rollup
TimeSeries 8 rollup

Issues identified (before applying HPC techniques)
• Expensive - too many servers 
• Exhibits parallel slow down
• More threads = worse performance 
• What did the profile say?
• Death by a thousand cuts
• The core library = 35% of profile

SignalFx
Basic properties and limitations
of modern hardware

SignalFx
L1 Data
L1
Instruction
L3
L1 Data
L1
Instruction
L2L2
Core 1 Core 2
Main memory

Cache Lines
• Data is transferred between memory and cache in blocks of
fixed size, called cache lines. Usually 64 bytes
• When the processor needs to read or write a location in
main memory, it first checks for a corresponding entry in the
cache. In the case of:
• a cache hit, the processor immediately reads or writes
the data in the cache line
• a cache miss, the cache allocates a new entry and
copies in data from main memory, then the request (read
or write) is fulfilled from the contents of the cache
• The memory subsystem makes two kinds of bets to help us:
• Temporal locality
• Spatial locality

Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

Our optimization goal
Convert a memory bandwidth bound
application to a CPU bound application

Things we kept in mind
• Measure, measure, measure! 
• Don’t rely on micro benchmarks alone

SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x

SignalFx library benchmark
Rollup data out
Key Value
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
35% of the profile of
the entire application

SignalFx
Techniques inspired by HPC that have
improved our pipeline
Single threaded, event based architectures:
parallelize by running multiple copies of
single threaded code

Single threaded event based architectures
• Threads work on their own private
data (as much as possible) 
• Communicate with other threads
using events/messages

SignalFx
local data
Network In thread Processor thread(s) Network out thread
Receive data
Process data
Write batched
data
Events
Events
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4

SignalFx
Network In thread Processor thread(s) Network out thread
Receive data
Process data Write batched
data
local data
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4

Single threaded event based architectures advantages
• It enables many other optimal choices like
• Compact array based data structures
• Buffer/object re-use 
• Loosely coupled - easy to test 
• Run multiple copies for parallelism

SignalFx
Ring
Buffer
Network In thread Worker thread(s) Network out thread
Receive data
Process data Write batched
data
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4

SignalFx
Worker thread
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Worker thread Worker thread
local data local data local data
Key Value
key 5 value 5
key 6 value 6
key 7 value 7
key 8 value 8
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Key Value
key 9 value 9
key 10 value 10
key 11 value 11
key 12 value 12

SignalFx
Ring
Buffer
Network thread Processor thread(s) Async IO thread
Receive data
Process data
Batched
IO calls
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
5
Ring
Buffer
6
7
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4

Advice for threaded applications
• Threads should ideally reflect the actual
parallelism of the system.
• Avoid gratuitous over subscribing
• Exception: IO threads? 
• DO NOT communicate unless you have to

SignalFx
Use compact, cache-conscious, array based
data structures with minimal indirection

Basic principles
• Strive for smaller data structures
• Extra computation is ok
• E.g. Compressing network data 
• Design data structures that facilitate
processing multiple entries—big
arrays! 
• Layout should reflect access patterns

Hash maps
• Hash maps look ups are NOT free! 
• A lookup in a well implemented hash
map is by definition a cache miss 
• Popular implementations like
java.util.HashMap can cause multiple
cache misses

valuekey
key* value*key* value*
List
List
List
List
Typical hash map implemented as an array of
lists of key* | value*

valuekey
List
List
List
List
1
Cache misses in a typical hash-map
implementation

valuekey
List
List
List
List
1
2
implementation

valuekey
List
List
List
List
1
2
3
implementation

valuekey
List
List
List
List
1
2
3 4
implementation

valuekey
Hash map implemented as an array of lists of
key* | value*
List
List
List
List

Array of co-located key/value
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7

Cache misses with no collision
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1

Cache misses with collisions
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
2

Hash map of key to index to an array of structs
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8

Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1

Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1 2

New library memory layout
TimeSeries rollup 0
TimeSeries rollup 1
TimeSeries rollup 2
TimeSeries rollup 3
TimeSeries rollup 4
TimeSeries rollup 5
TimeSeries rollup 6
TimeSeries rollup 7
TimeSeries rollup 8
ID Index
ID 0 1
ID 1 6
ID 2 4
ID 3 8
Raw data in Rollup out

Changing hash map implementations
• java.util.HashMap (uses separate chaining and boxes
primitives) to make a long -> int lookup
• Allocations galore 
• net.openhft.koloboke primitive open hash map
• 45% improvement 
For the JVM use libraries like https://github.com/
OpenHFT/Koloboke.
For C++ try https://github.com/preshing/
CompareIntegerMaps or similar.

Access patterns
Hot data
Cold data
object 0
object 1
object 2
object 3
Field 0 Field 1 Field 2 Field 3 Field 4

Group fields accessed together
Hot fields
Cold fields
object 1
Field 0 Field 1 Field 2
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4

Results of separating hot and cold data
A hot loop run about once every 500 ms
• Old - Hot and cold data kept together
• 5 cache lines per time series
• Took anywhere between 62-70 ms 
• New - Hot and cold data kept separate
• 3 cache lines of hot data per time series
• Took anywhere between 40-45 ms 
• 35% improvement

Old vs New
• Concurrent -> single threaded
• Locks gone
• Array based data structures
• Zero allocations
• Extensive batching and hardware prefetching 
 
• Multiple hash maps -> a single hash map look up

Old vs New
76 K/sec
VS
2.1 M/sec

SignalFx
Results (application)!

Amdahl’s law - 35%Overallspeedup
0
0.4
0.8
1.2
1.6
Library speed up
1 4 8 12 16 20 24 28 ∞

35% of the profile but 3.4x improvement?
• Amdahl’s law
• Max 1.54x improvement if 35% => 0% 
• Why 3.4x ?
• When you use less cache, you leave more for
others - thus speeding up other code too 
• Lesson
• A profiler is a necessary tool, but not a substitute for
informed design

Closing remarks / rant
• “Write code first, profile later” = BAD 
• Excessive encapsulation leads to
myopic decisions being made re: perf
• allocations
• “thread safe” code 
• Beware of micro benchmarks

SignalFx
Thank You!
Rajiv Kurian
rajiv@signalfx.com
@rzidane360
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com

Composition in C
Struct A Struct B
Struct B1
embedded
Struct B2
embedded
int
int
int
int
int
int
int
int

int
int
int
int
Composition in Java
Object A Object B
Object B1 Object B2
int
int
int
int
B1
B2

Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2

Actual layout?
Object B
B1
B2
B (header)
B1*
B2*

Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*

Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int

Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int

Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
B2 (header)
int
int

Potential layout after GC
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
Other data
B1 (header)
int
int
Other data
B2 (header
int
int

SignalFx
Separate the control and data planes

Frequent
Infrequent
A networking concept
Routing table
Packets in Packets out
Routing data
Key Value

What the control and data planes do
In networking terminology:
• Data plane - Defines the part that
decides what to do with packets
arriving on an inbound interface—
Frequent
• Control plane - Defines the part that is
concerned with drawing the network
map or routing table—Infrequent

The goal of control and data plane separation
DO NOT slow the frequent path because
of the infrequent path

Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
while (1) {
process_data_using_configuration_variables();
}
Flag 0
Flag 1
Flag 2
Flag 3

Flag 0
Flag 1
Flag 2
Flag 3
Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
Flag 0
Flag 1
Flag 2
Flag 3
while (1) {
cache_configuration_variables();
process_a_ton_of_stuff();
}
Cached configuration
variables

Volatile/atomic flag vs cached local flag
• All run time flags (used on every data point) are
volatile/atomic loads
• All run time flags are cached and refreshed on each
run loop
• About 8% improvement in datapoint/second. Others
might see more or less

Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Similar to Scaling ingest pipelines with high performance computing principles - Rajiv Kurian (20)

Recently uploaded

Recently uploaded (20)

Scaling ingest pipelines with high performance computing principles - Rajiv Kurian