Successfully reported this slideshow.
Your SlideShare is downloading. ×

Using Riak for Events storage and analysis at Booking.com

Ad

Event storage and real-time analysis at
Booking.com with Riak	
Damien Krotkine

Ad

Damien Krotkine
• Software Engineer at Booking.com
• github.com/dams
• @damsieboy
• dkrotkine

Ad

• 800,000 room nights reserved per day
WE ARE HIRING

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 112 Ad
1 of 112 Ad

Using Riak for Events storage and analysis at Booking.com

Download to read offline

At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.

At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.

More Related Content

Slideshows for you (17)

Similar to Using Riak for Events storage and analysis at Booking.com (20)

Using Riak for Events storage and analysis at Booking.com

  1. 1. Event storage and real-time analysis at Booking.com with Riak Damien Krotkine
  2. 2. Damien Krotkine • Software Engineer at Booking.com • github.com/dams • @damsieboy • dkrotkine
  3. 3. • 800,000 room nights reserved per day WE ARE HIRING
  4. 4. INTRODUCTION
  5. 5. www APImobi
  6. 6. www APImobi
  7. 7. www APIfrontendbackend mobi events storage events: info about subsystems status
  8. 8. backend web mobi api databases caches load balancersavailability cluster email etc…
  9. 9. WHAT IS AN EVENT ?
  10. 10. EVENT STRUCTURE • Provides info about subsystems • Data • Deep HashMap • Timestamp • Type + Subtype • The rest: specific data • Schema-less
  11. 11. EXAMPLE 1: WEB APP EVENT • Large event • Info about users actions • Requests, user type • Timings • Warnings, errors • Etc…
  12. 12. { timestamp => 12345, type => 'WEB', subtype => 'app', action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }
  13. 13. EXAMPLE 2: AVAILABILITY CLUSTER EVENT • Small event • Cluster provides availability info • Event: Info about request types and timings
  14. 14. { type => 'FAV', subtype => 'fav', timestamp => 1401262979, dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }
  15. 15. EVENTS FLOW PROPERTIES • Read-only • Schema-less • Continuous, ordered, timed • 15 K events per sec • 1.25 Billion events per day • peak at 70 MB/s, min 25MB/s • 100 GB per hour
  16. 16. SERIALIZATION • JSON didn’t work for us (slow, big, lack features) • Created Sereal in 2012 • « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization » • Added Sereal encoder & decoder in Erlang in 2014
  17. 17. USAGE
  18. 18. ASSESS THE NEEDS • Before thinking about storage • Think about the usage
  19. 19. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING
  20. 20. GRAPHS • Graph in real-time ( few seconds lag ) • Graph as many systems as possible • General platform health check
  21. 21. GRAPHS
  22. 22. GRAPHS
  23. 23. DASHBOARDS
  24. 24. META GRAPHS
  25. 25. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING
  26. 26. DECISION MAKING • Strategic decision ( use facts ) • Long term or short term • Technical / Non technical Reporting
  27. 27. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING
  28. 28. SHORT TERM ANALYSIS • From 10 sec ago -> 8 days ago • Code deployment checks and rollback • Anomaly Detector
  29. 29. USAGE 1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING
  30. 30. A/B TESTING • Our core philosophy: use facts • It means: do A/B testing • Concept of Experiments • Events provide data to compare
  31. 31. EVENT AGGREGATION
  32. 32. EVENT AGGREGATION • Group events • Granularity we need: second
  33. 33. event event events storage event event event event event event event event event event event
  34. 34. e e e e e e e e e e e e e ee e e e LOGGER e e
  35. 35. web api e e e e e e e e e e e e e e e e e e e e e e ee e e ee e
  36. 36. web api dbs e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e ee e e ee e
  37. 37. web api dbs e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e ee e e e 1 sec e e
  38. 38. web api dbs e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e ee e e e 1 sec e e
  39. 39. web api dbs e e e e e e e e e e e e 1 sec events storage
  40. 40. web api dbs e e e e e e e e e 1 sec events storage ee e reserialize + compress
  41. 41. events storage LOGGER …LOGGER LOGGER
  42. 42. STORAGE
  43. 43. WHAT WE WANT • Storage security • Mass write performance • Mass read performance • Easy administration • Very scalable
  44. 44. WE CHOSE RIAK • Security: cluster, distributed, very robust • Good and predictable read / write performance • The easiest to setup and administrate • Advanced features (MapReduce, triggers, 2i, CRDTs …) • Riak Search • Multi Datacenter Replication
  45. 45. CLUSTER • Commodity hardware • All nodes serve data • Data replication • Gossip between nodes • No master Ring of servers
  46. 46. hash(key)
  47. 47. KEY VALUE STORE • Namespaces: bucket • Values: opaque or CRDTs
  48. 48. RIAK: ADVANCED FEATURES • MapReduce • Secondary indexes • Riak Search • Multi DataCenter Replication
  49. 49. MULTI-BACKEND • Bitcask • Eleveldb • Memory
  50. 50. BACKEND: BITCASK • Log-based storage backend • Append-only files • Advanced expiration • Predictable performance • Perfect for reading sequential data
  51. 51. CLUSTER CONFIGURATION
  52. 52. DISK SPACE NEEDED • 8 days • 100 GB per hour • Replication 3 • 100 * 24 * 8 * 3 • Need 60 T
  53. 53. HARDWARE • 16 nodes • 12 CPU cores ( Xeon 2.5Ghz) • 192 GB RAM • network 1 Gbit/s • 8 TB (raid 6) • Cluster space: 128 TB
  54. 54. RIAK CONFIGURATION • Vnodes: 256 • Replication: n_val = 3 • Expiration: 8 days • 4 GB files • Compaction only when file is full • Compact only once a day
  55. 55. DISK SPACE RECLAIMED one day
  56. 56. DATA DESIGN
  57. 57. web api dbs e e e e e e e e e 1 sec events storage 1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks
  58. 58. DATA • Bucket name: “data“ • Key: “12345:1:cell0:WEB:app:chunk0“ • Value: serialized compressed data • About 120 keys per seconds
  59. 59. METADATA • Bucket name: “metadata“ • Key: epoch-dc “12345-2“ • Value: list of data keys:
 
 [ “12345:1:cell0:WEB:app:chunk0“,
 “12345:1:cell0:WEB:app:chunk1“
 …
 “12345:4:cell0:EMK::chunk3“ ] • As pipe separated value
  60. 60. PUSH DATA IN
  61. 61. PUSH DATA IN • In each DC, in each cell, Loggers push to Riak • 2 protocols: REST or ProtoBuf • Every seconds: • Push data values to Riak, async • Wait for success • Push metadata
  62. 62. JAVA Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2", Data3).execute(); Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();
  63. 63. Perl my $client = Riak::Client->new(…); $client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3); $client->put(metadata => '12345-1', $metadata, 'text/plain' );
  64. 64. GET DATA OUT
  65. 65. GET DATA OUT • Request metadata for epoch-DC • Parse value • Filter out unwanted types / subtypes • Fetch the data keys
  66. 66. Perl my $client = Riak::Client->new(…); my @array = split '|', $client->get(metadata => '12345-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @array;
  67. 67. REAL TIME PROCESSING OUTSIDE OF RIAK
  68. 68. STREAMING • Fetch 1 second every second • Or a range ( last 10 min ) • Client generates all the epochs for the range • Fetch all epochs from Riak
  69. 69. EXAMPLES • Continuous fetch => Graphite ( every sec ) • Continuous fetch => Anomaly Detector ( last 2 min ) • Continuous fetch => Experiment analysis ( last day ) • Continuous fetch => Hadoop • Manual request => test, debug, investigate • Batch fetch => ad hoc analysis • => Huge numbers of fetches
  70. 70. events storage graphite cluster Anomaly detector experiment
 cluster hadoop cluster mysql analysis manual requests 50 MB/s 50 MB/s 50 M B/s 50 M B/s 50 MB/s 50 MB/s
  71. 71. REALTIME • 1 second of data • Stored in < 1 sec • Available after < 1 sec • Issue : network saturation
  72. 72. REAL TIME PROCESSING INSIDE RIAK
  73. 73. THE IDEA • Instead of • Fetching data, crunch data, small result • Do • Bring code to data
  74. 74. WHAT TAKES TIME • Takes a lot of time • Fetching data out • Decompressing • Takes almost no time • Crunching data
  75. 75. MAPREDUCE • Send code to be executed • Works fine for 1 job • Takes < 1s to process 1s of data • Doesn’t work for multiple jobs • Has to be written in Erlang
  76. 76. HOOKS • Every time metadata is written • Post-Commit hook triggered • Crunch data on the nodes
  77. 77. Riak post-commit hook REST serviceRIAK service key key socket new data sent for storage fetch, decompress
 and process all tasks NODE HOST
  78. 78. HOOK CODE metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject), [ Epoch, DC ] = binary:split(Key, <<"-">>), Data = riak_object:get_value(RiakObject), DataKeys = binary:split(Data, <<"|">>, [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.
  79. 79. send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://" ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.
  80. 80. REST SERVICE • In Perl, using PSGI, Starman, preforks • Allow to write data cruncher in Perl • Also supports loading code on demand
  81. 81. ADVANTAGES • CPU usage and execution time can be capped • Data is local to processing • Two systems are decoupled • REST service written in any language • Data processing done all at once • Data is decompressed only once
  82. 82. DISADVANTAGES • Only for incoming data (streaming), not old data • Can’t easily use cross-second data • What if the companion service goes down ?
  83. 83. FUTURE • Use this companion to generate optional small values • Use Riak Search to index and search those
  84. 84. THE BANDWIDTH PROBLEM
  85. 85. • PUT - bad case • n_val = 3 • inside usage =
 3 x outside usage
  86. 86. • PUT - good case • n_val = 3 • inside usage =
 2 x outside usage
  87. 87. • GET - bad case • inside usage =
 3 x outside usage
  88. 88. • GET - good case • inside usage =
 2 x outside usage
  89. 89. • network usage ( PUT and GET ): • 3 x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network
  90. 90. • Usually it’s not a problem • But in our case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s • => network bandwidth issue
  91. 91. THE BANDWIDTH SOLUTIONS
  92. 92. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not speed 2. Don’t choose a node at random
  93. 93. • GET - bad case • n_val = 1 • inside usage =
 1 x outside
  94. 94. • GET - good case • n_val = 1 • inside usage =
 0 x outside
  95. 95. WARNING • Possible only because data is read-only • Data has internal checksum • No conflict possible • Corruption detected
  96. 96. RESULT • practical network usage reduced by 2 !
  97. 97. THE BANDWIDTH SOLUTIONS 1. Optimize GET for network usage, not speed 2. Don’t choose a node at random
  98. 98. • bucket = “metadata” • key = “12345”
  99. 99. • bucket = “metadata” • key = “12345” Hash = hashFunction(bucket + key) RingStatus = getRingStatus PrimaryNodes = Fun(Hash, RingStatus)
  100. 100. hashFunction() getRingStatus()
  101. 101. hashFunction() getRingStatus()
  102. 102. WARNING • Possible only if • Nodes list is monitored • In case of failed node, default to random • Data is requested in an uniform way
  103. 103. RESULT • Network usage even more reduced ! • Especially for GETs
  104. 104. CONCLUSION
  105. 105. CONCLUSION • We used only Riak Open Source • No training, self-taught, small team • Riak is a great solution • Robust, fast, scalable, easy • Very flexible and hackable • Helps us continue scaling
  106. 106. Q&A @damsieboy

×