SlideShare a Scribd company logo
1 of 84
Download to read offline
SignalFx
SignalFx
Scaling ingest pipelines with
high performance computing principles
Rajiv Kurian, Software Engineer
rajiv@signalfx.com
Agenda
1. Why we need to scale ingest
2. Basic properties and limitations of modern
hardware
3. Optimization techniques inspired by HPC
4. Results!
5. Q&A (hopefully!)
SignalFx
Why we need to scale ingest
• High resolution:
• Up to 1 sec
• Streaming analytics:
• Charts/analytics update @1sec
• Real time
• Multidimensional metrics:
• Dimensions : representing customer, server etc
• Filter, aggregate : 99th-pct-latency-by-service,customer
SignalFx is an advanced monitoring platform for modern applications
Ingest pipeline
ROLLUPS PERSIST
REST/RATE
CONTROL
Raw time
series data
Processed data to
analytics
SignalFx ingest library
Raw data in Rollup data out
TimeSeries 0 rollup
TimeSeries 1 rollup
TimeSeries 2 rollup
TimeSeries 3 rollup
TimeSeries 4 rollup
TimeSeries 5 rollup
TimeSeries 6 rollup
TimeSeries 7 rollup
TimeSeries 8 rollup
Issues identified (before applying HPC techniques)
• Expensive - too many servers

• Exhibits parallel slow down
• More threads = worse performance

• What did the profile say?
• Death by a thousand cuts
• The core library = 35% of profile
SignalFx
Basic properties and limitations
of modern hardware
SignalFx
L1 Data
L1
Instruction
L3
L1 Data
L1
Instruction
L2L2
Core 1 Core 2
Main memory
Cache Lines
• Data is transferred between memory and cache in blocks of
fixed size, called cache lines. Usually 64 bytes
• When the processor needs to read or write a location in
main memory, it first checks for a corresponding entry in the
cache. In the case of:
• a cache hit, the processor immediately reads or writes
the data in the cache line
• a cache miss, the cache allocates a new entry and
copies in data from main memory, then the request (read
or write) is fulfilled from the contents of the cache
• The memory subsystem makes two kinds of bets to help us:
• Temporal locality
• Spatial locality
Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
SignalFx
L1 CORE
SignalFx
L2 CORE
SignalFx
Main
Memory
CORE
Our optimization goal
Convert a memory bandwidth bound
application to a CPU bound application
Things we kept in mind
• Measure, measure, measure!

• Don’t rely on micro benchmarks alone
SignalFx
Benchmark
SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
35% of the profile of
the entire application
SignalFx
Techniques inspired by HPC that have
improved our pipeline
Single threaded, event based architectures:
parallelize by running multiple copies of
single threaded code
Single threaded event based architectures
• Threads work on their own private
data (as much as possible)

• Communicate with other threads
using events/messages
SignalFx
local data
Network In thread Processor thread(s) Network out thread
Receive data
Process data
Write batched
data
Events
Events
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
SignalFx
Network In thread Processor thread(s) Network out thread
Receive data
Process data Write batched
data
local data
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Single threaded event based architectures advantages
• It enables many other optimal choices like
• Compact array based data structures
• Buffer/object re-use

• Loosely coupled - easy to test

• Run multiple copies for parallelism
SignalFx
Ring
Buffer
Network In thread Worker thread(s) Network out thread
Receive data
Process data Write batched
data
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
SignalFx
Worker thread
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Worker thread Worker thread
local data local data local data
Key Value
key 5 value 5
key 6 value 6
key 7 value 7
key 8 value 8
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Key Value
key 9 value 9
key 10 value 10
key 11 value 11
key 12 value 12
SignalFx
Ring
Buffer
Network thread Processor thread(s) Async IO thread
Receive data
Process data
Batched
IO calls
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
5
Ring
Buffer
6
7
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Advice for threaded applications
• Threads should ideally reflect the actual
parallelism of the system.
• Avoid gratuitous over subscribing
• Exception: IO threads?

• DO NOT communicate unless you have to
SignalFx
Techniques inspired by HPC that have
improved our pipeline
Use compact, cache-conscious, array based
data structures with minimal indirection
SignalFx
L1 Data
L1
Instruction
L3
L1 Data
L1
Instruction
L2L2
Core 1 Core 2
Main memory
Basic principles
• Strive for smaller data structures
• Extra computation is ok
• E.g. Compressing network data

• Design data structures that facilitate
processing multiple entries—big
arrays!

• Layout should reflect access patterns
Hash maps
• Hash maps look ups are NOT free!

• A lookup in a well implemented hash
map is by definition a cache miss

• Popular implementations like
java.util.HashMap can cause multiple
cache misses
valuekey
key* value*key* value*
List
List
List
List
Typical hash map implemented as an array of
lists of key* | value*
valuekey
key* value*key* value*
List
List
List
List
1
Cache misses in a typical hash-map
implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
Cache misses in a typical hash-map
implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
3
Cache misses in a typical hash-map
implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
3 4
Cache misses in a typical hash-map
implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
3 4
Cache misses in a typical hash-map
implementation
valuekey
key* value*key* value*
Hash map implemented as an array of lists of
key* | value*
List
List
List
List
Array of co-located key/value
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
Cache misses with no collision
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
Cache misses with collisions
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
2
Hash map of key to index to an array of structs
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1
Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1 2
New library memory layout
TimeSeries rollup 0
TimeSeries rollup 1
TimeSeries rollup 2
TimeSeries rollup 3
TimeSeries rollup 4
TimeSeries rollup 5
TimeSeries rollup 6
TimeSeries rollup 7
TimeSeries rollup 8
ID Index
ID 0 1
ID 1 6
ID 2 4
ID 3 8
Raw data in Rollup out
Changing hash map implementations
• java.util.HashMap (uses separate chaining and boxes
primitives) to make a long -> int lookup
• Allocations galore

• net.openhft.koloboke primitive open hash map
• 45% improvement

For the JVM use libraries like https://github.com/
OpenHFT/Koloboke.
For C++ try https://github.com/preshing/
CompareIntegerMaps or similar.
Access patterns
Hot data
Cold data
object 0
object 1
object 2
object 3
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Group fields accessed together
Hot fields
Cold fields
object 1
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Results of separating hot and cold data
A hot loop run about once every 500 ms
• Old - Hot and cold data kept together
• 5 cache lines per time series
• Took anywhere between 62-70 ms

• New - Hot and cold data kept separate
• 3 cache lines of hot data per time series
• Took anywhere between 40-45 ms

• 35% improvement
SignalFx
Results (library)!
Old vs New
• Concurrent -> single threaded
• Locks gone
• Array based data structures
• Zero allocations
• Extensive batching and hardware prefetching



• Multiple hash maps -> a single hash map look up
Old vs New
Old vs New
76 K/sec
VS
2.1 M/sec
Old vs New
27x
Old vs New
35 %
SignalFx
Results (application)!
Amdahl’s law - 35%Overallspeedup
0
0.4
0.8
1.2
1.6
Library speed up
1 4 8 12 16 20 24 28 ∞
CPU
CPU
3.4x
35% of the profile but 3.4x improvement?
• Amdahl’s law
• Max 1.54x improvement if 35% => 0%

• Why 3.4x ?
• When you use less cache, you leave more for
others - thus speeding up other code too

• Lesson
• A profiler is a necessary tool, but not a substitute for
informed design
Heap growth
Closing remarks / rant
• “Write code first, profile later” = BAD

• Excessive encapsulation leads to
myopic decisions being made re: perf
• allocations
• “thread safe” code

• Beware of micro benchmarks
SignalFx
Thank You!
Rajiv Kurian
rajiv@signalfx.com
@rzidane360
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com
SignalFx
Bonus slides
Composition in C
Struct A Struct B
Struct B1
embedded
Struct B2
embedded
int
int
int
int
int
int
int
int
int
int
int
int
Composition in Java
Object A Object B
Object B1 Object B2
int
int
int
int
B1
B2
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
Actual layout?
Object B
B1
B2
B (header)
B1*
B2*
Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
B2 (header)
int
int
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
B2 (header)
int
int
Potential layout after GC
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
Other data
B1 (header)
int
int
Other data
B2 (header
int
int
SignalFx
Techniques inspired by HPC that have
improved our pipeline
Separate the control and data planes
Frequent
Infrequent
A networking concept
Routing table
Packets in Packets out
Routing data
Key Value
What the control and data planes do
In networking terminology:
• Data plane - Defines the part that
decides what to do with packets
arriving on an inbound interface—
Frequent
• Control plane - Defines the part that is
concerned with drawing the network
map or routing table—Infrequent
The goal of control and data plane separation
DO NOT slow the frequent path because
of the infrequent path
Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
while (1) {
process_data_using_configuration_variables();
}
Flag 0
Flag 1
Flag 2
Flag 3
Flag 0
Flag 1
Flag 2
Flag 3
Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
Flag 0
Flag 1
Flag 2
Flag 3
while (1) {
cache_configuration_variables();
process_a_ton_of_stuff();
}
Cached configuration
variables
Volatile/atomic flag vs cached local flag
• All run time flags (used on every data point) are
volatile/atomic loads
• All run time flags are cached and refreshed on each
run loop
• About 8% improvement in datapoint/second. Others
might see more or less

More Related Content

What's hot

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 

What's hot (20)

0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
 
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
 
Apache Storm In Retail Context
Apache Storm In Retail ContextApache Storm In Retail Context
Apache Storm In Retail Context
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
 
Raptor codes
Raptor codesRaptor codes
Raptor codes
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 

Viewers also liked

Viewers also liked (7)

Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFx
 
Docker at and with SignalFx
Docker at and with SignalFxDocker at and with SignalFx
Docker at and with SignalFx
 
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemMicroservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
 
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFx
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 

Similar to Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 

Similar to Scaling ingest pipelines with high performance computing principles - Rajiv Kurian (20)

hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

Recently uploaded

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

  • 2. SignalFx Scaling ingest pipelines with high performance computing principles Rajiv Kurian, Software Engineer rajiv@signalfx.com
  • 3. Agenda 1. Why we need to scale ingest 2. Basic properties and limitations of modern hardware 3. Optimization techniques inspired by HPC 4. Results! 5. Q&A (hopefully!)
  • 4. SignalFx Why we need to scale ingest
  • 5. • High resolution: • Up to 1 sec • Streaming analytics: • Charts/analytics update @1sec • Real time • Multidimensional metrics: • Dimensions : representing customer, server etc • Filter, aggregate : 99th-pct-latency-by-service,customer SignalFx is an advanced monitoring platform for modern applications
  • 6. Ingest pipeline ROLLUPS PERSIST REST/RATE CONTROL Raw time series data Processed data to analytics
  • 7. SignalFx ingest library Raw data in Rollup data out TimeSeries 0 rollup TimeSeries 1 rollup TimeSeries 2 rollup TimeSeries 3 rollup TimeSeries 4 rollup TimeSeries 5 rollup TimeSeries 6 rollup TimeSeries 7 rollup TimeSeries 8 rollup
  • 8. Issues identified (before applying HPC techniques) • Expensive - too many servers
 • Exhibits parallel slow down • More threads = worse performance
 • What did the profile say? • Death by a thousand cuts • The core library = 35% of profile
  • 9. SignalFx Basic properties and limitations of modern hardware
  • 11. Cache Lines • Data is transferred between memory and cache in blocks of fixed size, called cache lines. Usually 64 bytes • When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of: • a cache hit, the processor immediately reads or writes the data in the cache line • a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache • The memory subsystem makes two kinds of bets to help us: • Temporal locality • Spatial locality
  • 12. Reference latency numbers for comparison By Jeff Dean: http://research.google.com/people/jeff/ L1 Cache 0.5ns Branch mispredict 5 ns L2 Cache 7 ns 14x L1 Cache Mutex lock/unlock 25 ns Main memory 100 ns 20x L2 Cache, 200x L1 Cache Compress 1K bytes (Zippy) 3,000 ns Send 1K bytes over 1Gbps 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms Read 1MB sequentially from memory 250,000 ns 0.25 ms Round trip within same DC 500,000 ns 0.5 ms Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory Disk seek 10,000,000 ns 10 ms 20x DC roundtrip Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
  • 16. Our optimization goal Convert a memory bandwidth bound application to a CPU bound application
  • 17. Things we kept in mind • Measure, measure, measure!
 • Don’t rely on micro benchmarks alone
  • 19. SignalFx library benchmark Rollup data out Key Value ID 0 TimeSeries rollup 0 ID 1 TimeSeries rollup 1 ID 2 TimeSeries rollup 2 ID 3 TimeSeries rollup 3 ID 4 TimeSeries rollup 4 ….. ….. ….. ….. Key 1M TimeSeries rollup 1M Raw data in, in random order, one per Time Series. 50x
  • 20. SignalFx library benchmark Rollup data out Key Value ID 0 TimeSeries rollup 0 ID 1 TimeSeries rollup 1 ID 2 TimeSeries rollup 2 ID 3 TimeSeries rollup 3 ID 4 TimeSeries rollup 4 ….. ….. ….. ….. Key 1M TimeSeries rollup 1M Raw data in, in random order, one per Time Series. 50x 35% of the profile of the entire application
  • 21. SignalFx Techniques inspired by HPC that have improved our pipeline Single threaded, event based architectures: parallelize by running multiple copies of single threaded code
  • 22. Single threaded event based architectures • Threads work on their own private data (as much as possible)
 • Communicate with other threads using events/messages
  • 23. SignalFx local data Network In thread Processor thread(s) Network out thread Receive data Process data Write batched data Events Events Key Value key 1 value 1 key 2 value 2 key 3 value 3 key 4 value 4
  • 24. SignalFx Network In thread Processor thread(s) Network out thread Receive data Process data Write batched data local data Ring Buffer Ring Buffer Key Value key 1 value 1 key 2 value 2 key 3 value 3 key 4 value 4
  • 25. Single threaded event based architectures advantages • It enables many other optimal choices like • Compact array based data structures • Buffer/object re-use
 • Loosely coupled - easy to test
 • Run multiple copies for parallelism
  • 26. SignalFx Ring Buffer Network In thread Worker thread(s) Network out thread Receive data Process data Write batched data local data 1 2 3 4 Ring Buffer Ring Buffer Key Value key 1 value 1 key 2 value 2 key 3 value 3 key 4 value 4
  • 27. SignalFx Worker thread Receive data using Async IO Process data synchronously Write data using Async IO Receive data using Async IO Process data synchronously Write data using Async IO Receive data using Async IO Process data synchronously Write data using Async IO Worker thread Worker thread local data local data local data Key Value key 5 value 5 key 6 value 6 key 7 value 7 key 8 value 8 Key Value key 1 value 1 key 2 value 2 key 3 value 3 key 4 value 4 Key Value key 9 value 9 key 10 value 10 key 11 value 11 key 12 value 12
  • 28. SignalFx Ring Buffer Network thread Processor thread(s) Async IO thread Receive data Process data Batched IO calls local data 1 2 3 4 Ring Buffer Ring Buffer 5 Ring Buffer 6 7 Key Value key 1 value 1 key 2 value 2 key 3 value 3 key 4 value 4
  • 29. Advice for threaded applications • Threads should ideally reflect the actual parallelism of the system. • Avoid gratuitous over subscribing • Exception: IO threads?
 • DO NOT communicate unless you have to
  • 30. SignalFx Techniques inspired by HPC that have improved our pipeline Use compact, cache-conscious, array based data structures with minimal indirection
  • 32. Basic principles • Strive for smaller data structures • Extra computation is ok • E.g. Compressing network data
 • Design data structures that facilitate processing multiple entries—big arrays!
 • Layout should reflect access patterns
  • 33. Hash maps • Hash maps look ups are NOT free!
 • A lookup in a well implemented hash map is by definition a cache miss
 • Popular implementations like java.util.HashMap can cause multiple cache misses
  • 34. valuekey key* value*key* value* List List List List Typical hash map implemented as an array of lists of key* | value*
  • 35. valuekey key* value*key* value* List List List List 1 Cache misses in a typical hash-map implementation
  • 36. valuekey key* value*key* value* List List List List 1 2 Cache misses in a typical hash-map implementation
  • 37. valuekey key* value*key* value* List List List List 1 2 3 Cache misses in a typical hash-map implementation
  • 38. valuekey key* value*key* value* List List List List 1 2 3 4 Cache misses in a typical hash-map implementation
  • 39. valuekey key* value*key* value* List List List List 1 2 3 4 Cache misses in a typical hash-map implementation
  • 40. valuekey key* value*key* value* Hash map implemented as an array of lists of key* | value* List List List List
  • 41. Array of co-located key/value Key Value Key 0 Value 0 Key 1 Value 1 Key 2 Value 2 Key 3 Value 3 Key 4 Value 4 Key 5 Value 5 Key 6 Value 6 Key 7 Value 7
  • 42. Cache misses with no collision Key Value Key 0 Value 0 Key 1 Value 1 Key 2 Value 2 Key 3 Value 3 Key 4 Value 4 Key 5 Value 5 Key 6 Value 6 Key 7 Value 7 1
  • 43. Cache misses with collisions Key Value Key 0 Value 0 Key 1 Value 1 Key 2 Value 2 Key 3 Value 3 Key 4 Value 4 Key 5 Value 5 Key 6 Value 6 Key 7 Value 7 1 2
  • 44. Hash map of key to index to an array of structs Value 0 Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Key Index Key 0 1 Key 1 6 Key 2 4 Key 3 8
  • 45. Cache misses with collision Value 0 Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Key Index Key 0 1 Key 1 6 Key 2 4 Key 3 8 1
  • 46. Cache misses with collision Value 0 Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Key Index Key 0 1 Key 1 6 Key 2 4 Key 3 8 1 2
  • 47. New library memory layout TimeSeries rollup 0 TimeSeries rollup 1 TimeSeries rollup 2 TimeSeries rollup 3 TimeSeries rollup 4 TimeSeries rollup 5 TimeSeries rollup 6 TimeSeries rollup 7 TimeSeries rollup 8 ID Index ID 0 1 ID 1 6 ID 2 4 ID 3 8 Raw data in Rollup out
  • 48. Changing hash map implementations • java.util.HashMap (uses separate chaining and boxes primitives) to make a long -> int lookup • Allocations galore
 • net.openhft.koloboke primitive open hash map • 45% improvement
 For the JVM use libraries like https://github.com/ OpenHFT/Koloboke. For C++ try https://github.com/preshing/ CompareIntegerMaps or similar.
  • 49. Access patterns Hot data Cold data object 0 object 1 object 2 object 3 Field 0 Field 1 Field 2 Field 3 Field 4 Field 0 Field 1 Field 2 Field 3 Field 4 Field 0 Field 1 Field 2 Field 3 Field 4 Field 0 Field 1 Field 2 Field 3 Field 4
  • 50. Group fields accessed together Hot fields Cold fields object 1 Field 0 Field 1 Field 2 Field 0 Field 1 Field 2 Field 0 Field 1 Field 2 Field 0 Field 1 Field 2 Field 3 Field 4 Field 3 Field 4 Field 3 Field 4 Field 3 Field 4
  • 51. Results of separating hot and cold data A hot loop run about once every 500 ms • Old - Hot and cold data kept together • 5 cache lines per time series • Took anywhere between 62-70 ms
 • New - Hot and cold data kept separate • 3 cache lines of hot data per time series • Took anywhere between 40-45 ms
 • 35% improvement
  • 53. Old vs New • Concurrent -> single threaded • Locks gone • Array based data structures • Zero allocations • Extensive batching and hardware prefetching
 
 • Multiple hash maps -> a single hash map look up
  • 55. Old vs New 76 K/sec VS 2.1 M/sec
  • 59. Amdahl’s law - 35%Overallspeedup 0 0.4 0.8 1.2 1.6 Library speed up 1 4 8 12 16 20 24 28 ∞
  • 60. CPU
  • 62. 35% of the profile but 3.4x improvement? • Amdahl’s law • Max 1.54x improvement if 35% => 0%
 • Why 3.4x ? • When you use less cache, you leave more for others - thus speeding up other code too
 • Lesson • A profiler is a necessary tool, but not a substitute for informed design
  • 64. Closing remarks / rant • “Write code first, profile later” = BAD
 • Excessive encapsulation leads to myopic decisions being made re: perf • allocations • “thread safe” code
 • Beware of micro benchmarks
  • 65. SignalFx Thank You! Rajiv Kurian rajiv@signalfx.com @rzidane360 WE’RE HIRING jobs@signalfx.com @SignalFx - signalfx.com
  • 67. Composition in C Struct A Struct B Struct B1 embedded Struct B2 embedded int int int int int int int int
  • 68. int int int int Composition in Java Object A Object B Object B1 Object B2 int int int int B1 B2
  • 71. Actual layout? int int Object B Object B1 B1 B2 B (header) B1* B2*
  • 72. Actual layout? int int Object B Object B1 B1 B2 B (header) B1* B2* B1 (header) int int
  • 73. Actual layout? int int Object B Object B1 B1 B2 B (header) B1* B2* B1 (header) int int
  • 74. Actual layout? int int int int Object B Object B1 Object B2 B1 B2 B (header) B1* B2* B1 (header) int int
  • 75. Actual layout? int int int int Object B Object B1 Object B2 B1 B2 B (header) B1* B2* B1 (header) int int B2 (header) int int
  • 76. Actual layout? int int int int Object B Object B1 Object B2 B1 B2 B (header) B1* B2* B1 (header) int int B2 (header) int int
  • 77. Potential layout after GC int int int int Object B Object B1 Object B2 B1 B2 B (header) B1* B2* Other data B1 (header) int int Other data B2 (header int int
  • 78. SignalFx Techniques inspired by HPC that have improved our pipeline Separate the control and data planes
  • 79. Frequent Infrequent A networking concept Routing table Packets in Packets out Routing data Key Value
  • 80. What the control and data planes do In networking terminology: • Data plane - Defines the part that decides what to do with packets arriving on an inbound interface— Frequent • Control plane - Defines the part that is concerned with drawing the network map or routing table—Infrequent
  • 81. The goal of control and data plane separation DO NOT slow the frequent path because of the infrequent path
  • 82. Runtime configuration variables Worker threadConfiguration variables (volatile/atomic) Setter thread while (1) { process_data_using_configuration_variables(); } Flag 0 Flag 1 Flag 2 Flag 3
  • 83. Flag 0 Flag 1 Flag 2 Flag 3 Runtime configuration variables Worker threadConfiguration variables (volatile/atomic) Setter thread Flag 0 Flag 1 Flag 2 Flag 3 while (1) { cache_configuration_variables(); process_a_ton_of_stuff(); } Cached configuration variables
  • 84. Volatile/atomic flag vs cached local flag • All run time flags (used on every data point) are volatile/atomic loads • All run time flags are cached and refreshed on each run loop • About 8% improvement in datapoint/second. Others might see more or less