Using Riak for Events storage and analysis at Booking.com

Event storage and real-time analysis at
Booking.com with Riak
Damien Krotkine

Damien Krotkine
• Software Engineer at Booking.com
• github.com/dams
• @damsieboy
• dkrotkine

• 800,000 room nights reserved per day
WE ARE HIRING

www APIfrontendbackend
mobi
events storage
events: info about
subsystems status

backend
web mobi api
databases
caches
load
balancersavailability
cluster
email
etc…

EVENT STRUCTURE
• Provides info about subsystems
• Data
• Deep HashMap
• Timestamp
• Type + Subtype
• The rest: speciﬁc data
• Schema-less

EXAMPLE 1: WEB APP EVENT
• Large event
• Info about users actions
• Requests, user type
• Timings
• Warnings, errors
• Etc…

{ timestamp => 12345,
type => 'WEB',
subtype => 'app',
action => { is_normal_user => 1,
pageview_id => '188a362744c301c2',
# ...
},
tuning => { the_request => 'GET /display/...'
bytes_body => 35,
wallclock => 111,
nr_warnings => 0,
# ...
},
# ...
}

EXAMPLE 2: AVAILABILITY CLUSTER EVENT
• Small event
• Cluster provides availability info
• Event: Info about request types and timings

{ type => 'FAV',
subtype => 'fav',
timestamp => 1401262979,
dc => 1,
tuning => {
flatav => {
cluster => '205',
sum_latencies => 21,
role => 'fav',
num_queries => 7
}
}
}

EVENTS FLOW PROPERTIES
• Read-only
• Schema-less
• Continuous, ordered, timed
• 15 K events per sec
• 1.25 Billion events per day
• peak at 70 MB/s, min 25MB/s
• 100 GB per hour

SERIALIZATION
• JSON didn’t work for us (slow, big, lack features)
• Created Sereal in 2012
• « Sereal, a new, binary data serialization format that
provides high-performance, schema-less serialization »
• Added Sereal encoder & decoder in Erlang in 2014

ASSESS THE NEEDS
• Before thinking about storage
• Think about the usage

USAGE
1. GRAPHS
2. DECISION MAKING
3. SHORT TERM ANALYSIS
4. A/B TESTING

GRAPHS
• Graph in real-time ( few seconds lag )
• Graph as many systems as possible
• General platform health check

DECISION MAKING
• Strategic decision ( use facts )
• Long term or short term
• Technical / Non technical Reporting

SHORT TERM ANALYSIS
• From 10 sec ago -> 8 days ago
• Code deployment checks and rollback
• Anomaly Detector

A/B TESTING
• Our core philosophy: use facts
• It means: do A/B testing
• Concept of Experiments
• Events provide data to compare

EVENT AGGREGATION
• Group events
• Granularity we need: second

event
event
events storage
event
event
event
event
event
event
event
event
event
event
event

e e
e e
e
e
e e
e
e
e
e
e
ee
e
e
e
LOGGER
e
e

web api
e
e
e
e
e
e
e
e
e
e e
e e
e
e
e e
e
e
e
e
e
ee
e
e
ee
e

web api dbs
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e e
e e
e
e
e e
e
e
e
e
e
ee
e
e
ee
e

web api dbs
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e e
e e
e
e
e e
e
e
e
e
e
ee
e
e
e
1 sec
e
e

web api dbs
e
e
e
e
e
e
e
e
e
e
e
e
1 sec
events storage

web api dbs
e
e
e
e
e
e
e
e
e
1 sec
events storage
ee e reserialize
+ compress

events storage
LOGGER …LOGGER LOGGER

WHAT WE WANT
• Storage security
• Mass write performance
• Mass read performance
• Easy administration
• Very scalable

WE CHOSE RIAK
• Security: cluster, distributed, very robust
• Good and predictable read / write performance
• The easiest to setup and administrate
• Advanced features (MapReduce, triggers, 2i, CRDTs …)
• Riak Search
• Multi Datacenter Replication

CLUSTER
• Commodity hardware
• All nodes serve data
• Data replication
• Gossip between nodes
• No master
Ring of servers

KEY VALUE STORE
• Namespaces: bucket
• Values: opaque or CRDTs

RIAK: ADVANCED FEATURES
• MapReduce
• Secondary indexes
• Riak Search
• Multi DataCenter Replication

MULTI-BACKEND
• Bitcask
• Eleveldb
• Memory

BACKEND: BITCASK
• Log-based storage backend
• Append-only ﬁles
• Advanced expiration
• Predictable performance
• Perfect for reading sequential data

DISK SPACE NEEDED
• 8 days
• 100 GB per hour
• Replication 3
• 100 * 24 * 8 * 3
• Need 60 T

HARDWARE
• 16 nodes
• 12 CPU cores ( Xeon 2.5Ghz)
• 192 GB RAM
• network 1 Gbit/s
• 8 TB (raid 6)
• Cluster space: 128 TB

RIAK CONFIGURATION
• Vnodes: 256
• Replication: n_val = 3
• Expiration: 8 days
• 4 GB ﬁles
• Compaction only when ﬁle is full
• Compact only once a day

web api dbs
e
e
e
e
e
e
e
e
e
1 sec
events storage
1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE
500 KB max chunks

DATA
• Bucket name: “data“
• Key: “12345:1:cell0:WEB:app:chunk0“
• Value: serialized compressed data
• About 120 keys per seconds

METADATA
• Bucket name: “metadata“
• Key: epoch-dc “12345-2“
• Value: list of data keys: 
 
[ “12345:1:cell0:WEB:app:chunk0“, 
“12345:1:cell0:WEB:app:chunk1“ 
… 
“12345:4:cell0:EMK::chunk3“ ]
• As pipe separated value

PUSH DATA IN
• In each DC, in each cell, Loggers push to Riak
• 2 protocols: REST or ProtoBuf
• Every seconds:
• Push data values to Riak, async
• Wait for success
• Push metadata

JAVA
Bucket DataBucket = riakClient.fetchBucket("data").execute();
DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute();
Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute();
MetaDataBucket.store("12345-1", metaData).execute();
riakClient.shutdown();

Perl
my $client = Riak::Client->new(…);
$client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1);
$client->put(metadata => '12345-1', $metadata, 'text/plain' );

GET DATA OUT
• Request metadata for epoch-DC
• Parse value
• Filter out unwanted types / subtypes
• Fetch the data keys

Perl
my $client = Riak::Client->new(…);
my @array = split '|', $client->get(metadata => '12345-1');
@filtered_array = grep { /WEB/ } @array;
$client->get(data => $_) foreach @array;

REAL TIME PROCESSING OUTSIDE OF RIAK

STREAMING
• Fetch 1 second every second
• Or a range ( last 10 min )
• Client generates all the epochs for the range
• Fetch all epochs from Riak

EXAMPLES
• Continuous fetch => Graphite ( every sec )
• Continuous fetch => Anomaly Detector ( last 2 min )
• Continuous fetch => Experiment analysis ( last day )
• Continuous fetch => Hadoop
• Manual request => test, debug, investigate
• Batch fetch => ad hoc analysis
• => Huge numbers of fetches

events storage
graphite
cluster
Anomaly
detector
experiment 
cluster
hadoop
cluster
mysql
analysis
manual
requests
50 MB/s
50 MB/s
50
M
B/s
50
M
B/s
50 MB/s
50 MB/s

REALTIME
• 1 second of data
• Stored in < 1 sec
• Available after < 1 sec
• Issue : network saturation

REAL TIME PROCESSING INSIDE RIAK

THE IDEA
• Instead of
• Fetching data, crunch data, small result
• Do
• Bring code to data

WHAT TAKES TIME
• Takes a lot of time
• Fetching data out
• Decompressing
• Takes almost no time
• Crunching data

MAPREDUCE
• Send code to be executed
• Works ﬁne for 1 job
• Takes < 1s to process 1s of data
• Doesn’t work for multiple jobs
• Has to be written in Erlang

HOOKS
• Every time metadata is written
• Post-Commit hook triggered
• Crunch data on the nodes

Riak post-commit hook
REST serviceRIAK service
key key
socket
new data sent for storage
fetch, decompress 
and process all tasks
NODE HOST

HOOK CODE
metadata_stored_hook(RiakObject) ->
Key = riak_object:key(RiakObject),
Bucket = riak_object:bucket(RiakObject),
[ Epoch, DC ] = binary:split(Key, <<"-">>),
Data = riak_object:get_value(RiakObject),
DataKeys = binary:split(Data, <<"|">>, [ global ]),
send_to_REST(Epoch, Hostname, DataKeys),
ok.

send_to_REST(Epoch, Hostname, DataKeys) ->
Method = post,
URL = "http://" ++ binary_to_list(Hostname)
++ ":5000?epoch=" ++ binary_to_list(Epoch),
HTTPOptions = [ { timeout, 4000 } ],
Options = [ { body_format, string },
{ sync, false },
{ receiver, fun(ReplyInfo) -> ok end }
],
Body = iolist_to_binary(mochijson2:encode( DataKeys )),
httpc:request(Method, {URL, [], "application/json", Body},
HTTPOptions, Options),
ok.

REST SERVICE
• In Perl, using PSGI, Starman, preforks
• Allow to write data cruncher in Perl
• Also supports loading code on demand

ADVANTAGES
• CPU usage and execution time can be capped
• Data is local to processing
• Two systems are decoupled
• REST service written in any language
• Data processing done all at once
• Data is decompressed only once

DISADVANTAGES
• Only for incoming data (streaming), not old data
• Can’t easily use cross-second data
• What if the companion service goes down ?

FUTURE
• Use this companion to generate optional small values
• Use Riak Search to index and search those

• PUT - bad case
• n_val = 3
• inside usage = 
3 x outside usage

• PUT - good case
• n_val = 3
2 x outside usage

• GET - bad case
3 x outside usage

• GET - good case
2 x outside usage

• network usage ( PUT and GET ):
• 3 x 13/16+ 2 x 3/16= 2.81
• plus gossip
• inside network > 3 x outside network

• Usually it’s not a problem
• But in our case:
• big values, constant PUTs, lots of GETs
• sadly, only 1 Gbit/s
• => network bandwidth issue

THE BANDWIDTH SOLUTIONS
1. Optimize GET for network usage, not speed
2. Don’t choose a node at random

• GET - bad case
• n_val = 1
1 x outside

• GET - good case
• n_val = 1
0 x outside

WARNING
• Possible only because data is read-only
• Data has internal checksum
• No conﬂict possible
• Corruption detected

RESULT
• practical network usage reduced by 2 !

• bucket = “metadata”
• key = “12345”

• bucket = “metadata”
• key = “12345”
Hash = hashFunction(bucket + key)
RingStatus = getRingStatus
PrimaryNodes = Fun(Hash, RingStatus)

hashFunction()
getRingStatus()

WARNING
• Possible only if
• Nodes list is monitored
• In case of failed node, default to random
• Data is requested in an uniform way

RESULT
• Network usage even more reduced !
• Especially for GETs

CONCLUSION
• We used only Riak Open Source
• No training, self-taught, small team
• Riak is a great solution
• Robust, fast, scalable, easy
• Very ﬂexible and hackable
• Helps us continue scaling

Using Riak for Events storage and analysis at Booking.com

More Related Content

What's hot

Similar to Using Riak for Events storage and analysis at Booking.com

More from Damien Krotkine

Recently uploaded

Using Riak for Events storage and analysis at Booking.com