Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

1,509 views

Published on

In the world of Real-time bidding (RTB), it is crucial to get performance metrics as soon as possible. This is why AdGear build their own real-time analytics system.

In this talk, Louis-Philippe will share with you what he has learnt building this system and he will introduce Swirl, AdGear's lightweight distributed stream processor. He will also give some clues on how to build a subset of SQL to power your distributed jobs.

Talk objectives:
- Introduce Swirl, a lightweight distributed stream processor
- Implement a subset of SQL (lexer + parser + boolean logic)
- Demo real-time graphing web interface powered by Swirl, Cowboy, Bullet and D3.js

Published in: Technology
  • Be the first to comment

Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

  1. 1. Building “sexy” real-time analytics systems
  2. 2. AdGear is full-stack ad platform for publishers and advertisers, with advanced analytics, attribution measurement, ad serving, and real-time bidding technology.
  3. 3. Real-time bidding (RTB)
  4. 4. Real-time reporting... why? • • • help clients to make informed decisions • • should I increase the bid price? should I bid on exchange X? inventory control (brand safety) debugging (bots detection, creatives audits)
  5. 5. “Sexy” real-time analytics systems
  6. 6. “Sexy”? • • elegant backend beautiful user interface
  7. 7. Architecture #1 (3 years ago) • • • ssh node.js socket.io
  8. 8. Problems • • • • • no SMP support • • each process needs to be monitored requires load-balancing (nginx) duplicated state (per process) duplicated work (de-serialization) bad error handling (event loop explodes) callbacks...
  9. 9. * promise construct
  10. 10. Architecture #2 (1.5 years ago) • • • • ssh_channel * gproc (pub sub) ETS counters bullet (cowboy) * https://gist.github.com/lpgauth/6529807
  11. 11. Architecture #2 1. receive buffered events, split and de-serialize 2. each event is sent to a collector process (3) using gproc (pubsub) for filtering 3. collector (gen_server) aggregates message using ETS counters and flush every second 4. bullet handler serializes the aggregates (tab2list to json)
  12. 12. Problems • ssh_channel process and collector process are bottlenecks • number of messages increases with the number of clients • • requires lots of bandwidth for large streams limited filtering (match specs)
  13. 13. Improvements... (6 months ago) • • optimize collector’s msg loop (gen_server to proc_lib) use ssh compression • • added support for openssh zlib compression * R16B02 * https://github.com/lpgauth/otp/tree/openssh_zlib
  14. 14. This worked for a while...
  15. 15. “Hey man, it would be very cool if you could show in real-time the number of bid requests per domain for Friday’s demo... Can you do it?” - boss
  16. 16. Sure.
  17. 17. What did I just agree too... • • I only have 3 days to build this... bid requests stream is too large to aggregate in a central location (1+ Gbit/s - 80K+/s)
  18. 18. Strategy for demo 1. move aggregation upstream 2. use ETS match select to find table ids (filtering) 3. increment counters in process (no message!) 4. periodically flush aggregates via message to collector node 5. collector node increments local counters and periodically flush aggregates to bullet handler
  19. 19. Success!
  20. 20. Introducing swirl! “lightweight distributed stream processor”
  21. 21. Swirl components • • “dynamic” streams (swirl_stream) • • powerful filtering language (swirl_ql) simple behavior that implements a map-reduce like interface (swirl_flow) process registry (swirl_tracker)
  22. 22. Streams
  23. 23. Flows * application:start(swirl).
  24. 24. swirl_flow behavior
  25. 25. Mapper Node 1. process “emits” event 2. lookup in ETS if there’s a flow that matches the stream name and filter 3. if there’s a match, call flow_mod:map/4 4. if map returns counters, increment in ETS 5. swirl_mapper periodically flush aggregates to reducer node
  26. 26. Reducer Node 1. swirl_tracker receives mapper aggregates and forwards it to reducer 2. reducer increments counters in ets 3. reducer flushes counters to flow_mod:reduce/4
  27. 27. Swirl-ql • • sql where clause like syntax supported operators: • • • • AND / OR <, <=, =, >, <> IN (x, y) / NOT IN (x, y, z) IS NULL / IS NOT NULL (undefined) * https://github.com/lpgauth/swirl-ql
  28. 28. Swirl-ql • examples: • • • “event IN (‘impression’, ‘click’)”! “buyer_id IS NOT NULL AND buyer_id <> 3”! “event = ‘impressions’ AND (buyer_id IN (3, 5) OR buyer_id IS NULL)
  29. 29. Swirl-ql • • • • leex / yecc for parsing (use lex / yacc doc) pattern match ftw! use hipe (~200% speed gain in micro benchmarks) • 0.286 vs 0.097 microseconds * experimenting with dynamic compilation * http://theory.stanford.edu/~sergei/papers/sigmod10-index.pdf
  30. 30. Swirl limitations • • best-effort (hard problem!) • • netsplits crash in-memory only
  31. 31. Todo • • • • node discovery code distribution resource limitation better documentation!
  32. 32. Architecture #3 (now!) • • swirl bullet (cowboy)
  33. 33. Demo! * https://github.com/lpgauth/swirl-demo
  34. 34. Thank You! pssst: we’re hiring! twitter: lpgauth github: lpgauth

×