From node.js to Scala - with a 100x performance boost

FROM TO
with a 100x perf. boost!
BY ITAMAR RAVID | MAY 3, 2016

t
AGENDA
WE’LL TALK ABOUT…
• What we do, our challenges and what led us to Scala and Akka;
• How we redesigned our core data processing service;
• Some useful lessons and patterns.
There will be relatively little node.js bashing. Promise.

t
BIGPANDA: THE ANSWER TO ALERT FATIGUE
RABBIT IS DOWN!
NO FREE SPACE!
INBOUND QUEUE OVERFLOWING!
OUTBOUND QUEUE OVERFLOWING!
APPLICATION HEALTH CRITICAL!
TOO MANY FAILED HTTP REQS!
rabbit-1, ping
rabbit-2, disk
queue-1, size
queue-2, size
app1, health
app2, 500 codes
RabbitMQ cluster
ping disk
RabbitMQ node 3
queue size queue size
API server
health failed reqs
CorrelationAlgorithm

t
Correlation
Stage
Normalization
Stage
IN TERMS OF STREAMS…
RABBIT IS DOWN!
NO FREE SPACE!
INBOUND QUEUE OVERFLOWING!
OUTBOUND QUEUE OVERFLOWING!
APPLICATION HEALTH CRITICAL!
TOO MANY FAILED HTTP REQS!
Nagios event source
Datadog event source
AppDynamics
event source
rabbit-1, ping
rabbit-2, disk
queue-1, size
queue-2, size
app1, health
app2, 500 codes
RabbitMQ cluster
ping disk
RabbitMQ node 3
queue size queue size
API server
health failed reqs
CorrelationAlgorithm

CHALLENGE 1
SCALING TO MEET CUSTOMER LOAD

t
HIGH-LEVEL ARCHITECTURE
API servers
API servers
API servers
Normalization Correlation
Correlation
Correlation
RabbitMQ
Exchange
Normalization
Normalization
RabbitMQ
Exchange
Mongo
RabbitMQ
Exchange

t
USAGE OF RABBITMQ
Correlation
Correlation
Correlation
RabbitMQ
Cons. Hash
Queue
(Customers A, B, C)
Queue
(Customers D, E, F)
Queue
(Customers X, Y, Z)
Route by
hash on
Customer
DATA FOR A GIVEN CUSTOMER MUST BE PROCESSED SERIALLY,
IN ORDER. SO…

t
MEET REALITY!
Not fun!
A hiccup in a customer’s datacenter =>
An entire queue is blocked

CHALLENGE 2
CORRELATION PREVIEW

t
CORRELATION
Same host, 4 hours
…
MATCHING RULES
+
INCIDENT
rabbit-1
ping disk
rabbit-1, ping, t=5
rabbit-1, disk, t=7

t
CORRELATION
MATCHING RULES
+
INCIDENT
rabbit-1
ping disk
Same host, 4 hours
30 minutes
rabbit-1, ping, t=5
rabbit-1, disk, t=7

t
CORRELATION
MATCHING RULES
+
INCIDENT
rabbit-1, ping, t=5
rabbit-1, disk, t=7
Same host, 4 hours
30 minutes
?

t
A CORRELATION TIME-MACHINE
1 2 3 4 5 6 7 8 9 N…10
ALERTS WE’RE HERE
START FROM HERE
(DC OUTAGE)
Correlation
Servers
OFFSETS

t
THIS MEANS…
REPLAY DETERMINISTICFAST

t
EXISTING CORRELATION SOLUTION
Processing
Stage
Mongo
RabbitRabbit RabbitRabbit
Processing
Stage
Processing
Stage
PROCESSING STAGE - A NODE.JS CALLBACK.
Shared mutable
state
No isolation
No replay

t
DESIRED SOLUTION
Processing
Stage
RabbitRabbit
Processing
Stage
Mongo
Processing
Stage

t
NODE.JS - PLATFORM LIMITATIONS
HEAP SIZE - LIMITED TO 1.7GB
SINGLE THREADED :-(
TypeError: undefined is not a function

t
COMPONENTS
DURABLE EVENT STREAM
PLATFORM
COMPUTING FRAMEWORK

t
ACTOR-BASED SOLUTION
Node Manager
Customer A
Pipeline
Kafka
Reader
Algorithm
runner
Mongo
Writer
Rabbit
Writer
Customer B
Pipeline
Customer C
Pipeline
SUPERVISION
MESSAGING
customer_a_inputs

t
NEXT-GEN SOLUTION
Node Manager
Customer A
Pipeline
Kafka
Reader
Algorithm
runner
Mongo
Writer
Rabbit
Writer
Customer B
Pipeline
Customer C
Pipeline
SUPERVISION
MESSAGING
FAILURE
ISOLATION
customer_a_inputs

t
NEXT-GEN SOLUTION
Node Manager
Customer A
Pipeline
Kafka
Reader
Algorithm
runner
Mongo
Writer
Rabbit
Writer
Customer B
Pipeline
Customer C
Pipeline
SUPERVISION
MESSAGING
SEPARATE DISPATCHERS
FOR QOS-TUNING
customer_a_inputs

t
SCALING OUT
Node 1
Cluster
Manager
Node
Manager
Node 2
Node
Manager
Node 3
Node
Manager

t
PRUNING AN INFINITE DATA STREAM
1 2 3 4 5 6 7 8 9 N…10

t
1 2 3 4 5 6 7 8 9 N…10
t=10, Critical t=8, OK

t
5 6 7 8 9 N…10
t=8, OK
MISSING
ALERTS :-(
PRUNING STREAMS THAT RESULT IN
STATE REQUIRES STATE RECOVERY.

t
5 6 7 8 9 N…10
Snapshot
Repository
<data …>
lastOffset: 4
<data …>
lastOffset: 8
<data …>
lastOffset: 10
ON BOOT, LATEST SNAPSHOT IS LOADED
AND STREAM IS SEEKED TO STORED OFFSET.

t
CHALLENGES:
- COMPACTNESS
- SCHEMA EVOLUTION
kryo/chill with a manual de/serializer <=> Map[String, Any]
Schema evolution support with some caveats
Big datasets are only a few MBs in size

USE SNAPSHOTS TO PRUNE STREAMS
JSON IS NOT THE ONLY SOLUTION!
KEY TAKEAWAYS

t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES
Kafka
reader
Algorithm
Runner
Mongo
Writer
Rabbit
Writer

t
INPUTS
MSG BATCHES
1
PIPELINING BETWEEN STAGES
Kafka
reader
Algorithm
Runner
Mongo
Writer
Rabbit
Writer

t
INPUTS
MSG BATCHES
2 1
Kafka
reader
Algorithm
Runner
Mongo
Writer
Rabbit
Writer

t
INPUTS
MSG BATCHES
3 2 1
Kafka
reader
Algorithm
Runner
Mongo
Writer
Rabbit
Writer

t
Kafka
reader
Algorithm
Runner
Mongo
Writer
Rabbit
Writer
INPUTS
MSG BATCHES
3 2 1
RETRYING
Persistent failure
will restart entire
pipeline

t
INPUTS
MSG BATCHES
4 3 2 1
RETRYING
Kafka
reader
Algorithm
Runner
Mongo
Writer
Rabbit
Writer

CAPTURE COMMON ACTOR
BEHAVIOR USING TRAITS
(BUT MAKE SURE THEY COMPOSE!)
KEY TAKEAWAYS

t
DEFERRING AND CONTROLLING STATE MUTATION
PREVIOUSLY:
Processing
Stage
Mongo
Processing
Stage
Processing
Stage
HERE BE RACE CONDITIONS!

t
DEFERRING AND CONTROLLING STATE MUTATION
Algorithm
runner
Mongo
Mongo
Writer
Instructions
AN INTERPRETER

t
DEFERRING STATE MUTATION
id1 id2 id1 id1 id2 id2 id1 id2 id1 id1 id2 id2
Mongo
get
set
OPTIMIZE ME!

t
FOLDING INSTRUCTIONS TO REDUCE I/O
id1 -> inst1 :: inst2 :: inst3 … :: Nil
id2 -> inst1 :: inst2 :: inst3 … :: Nil
Mongo
getMultiple setMultiple
foldLeft(initialObject)(processInstruction)

DECOUPLE STATE MUTATION FROM PROCESSING
OPTIMIZE STATE MUTATION WHEN INTERPRETING
KEY TAKEAWAYS

t
MEASURE!
Dropwizard Metrics + metrics-scala:

KEY TAKEAWAYS
INSTRUMENT AWAY!

t
FINAL NUMBERS AND BENEFITS
OVERALL RATE IMPROVMENT:
~ 16 events/s on a single node.js process at peak
1600-2500 events/s on a single pipeline at peak
ISOLATION
COMPLETE DETERMINISM
SCALABILITY
Actor-per-Customer; failure isolation
More nodes => more actors; reduced I/O
Actions determined entirely by Kafka contents;
amazing for debugging!

WE’RE HIRING!
iravid@bigpanda.io

t
GROCERY LIST
RabbitMQ - op-rabbit
MongoDB - reactivemongo
Kafka - kafka-clients
Zookeeper - curator
Dependency Injection - scaldi
Logging - log4j2, scala-logging, raven-log4j2
Metrics - Dropwizard Metrics, metrics-scala
Conﬁg - Typesafe Config
JSON - play-json
Binary serde - kryo/chill

From node.js to Scala - with a 100x performance boost

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to From node.js to Scala - with a 100x performance boost

Similar to From node.js to Scala - with a 100x performance boost (20)

Recently uploaded

Recently uploaded (20)

From node.js to Scala - with a 100x performance boost