Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Papers we love realtime at facebook

739 views

Published on

Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.

Published in: Data & Analytics
  • Be the first to comment

Papers we love realtime at facebook

  1. 1. 1 Papers We Love: Realtime Data Processing at Facebook Gwen Shapira Confluent Inc.
  2. 2. 2 Papers We Love: Realtime Data Processing at Facebook
  3. 3. 3 Published in 2016 (!)
  4. 4. 4 What kind of paper is this?
  5. 5. 5 This is NOT The one true architecture . Please don’t cargo-cult this paper
  6. 6. 6 Few real-time systems at Facebook • Chorus – aggregate trends • Realtime feedback for mobile app developers • Page analytics – likes, engagement… • Offload CPU-intensive dashboard queries
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. 10 Looking for trending topics in 5 minute windows
  11. 11. 11 The Tofu & Potatoes of the paper: Design Decisions
  12. 12. 12 / KafkaStreams + exactly once
  13. 13. 13 Decision #1 – Language Paradigm • Declarative (SQL) – easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.
  14. 14. 14 Decision #1 – Language Paradigm • Declarative (SQL) – easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.
  15. 15. 15 Decision #2: Data Transfer • RPC (Millwheel, Flink, SparkStreaming) • All about speed • Message-forwarding broker (Heron) • Applies back-pressure, multiplex • Persistent stream storage (Samza, Kafka’s Stream API) • Most reliable • Decouples processors
  16. 16. 16 Decision #2: Data Transfer
  17. 17. 17 Love Song to Scribe Independent stream processing nodes And storing inputs / outputs Made everything great
  18. 18. 18 Decision #3 – Processing Semantics
  19. 19. 19 Decision #3 – Processing Semantics Facebook Verdict: It depends on requirements • Ranker writes to idempotent system – at least once • Scuba can lose data, but not handle duplicates – at most once • …. Exactly once is REALLY HARD and requires transactions
  20. 20. 20 Don’t miss the side-note on side-effects • Exactly once means writing output + offsets to a transactional system • This takes time • Why just wait when you can deserialize? And maybe do other stateless stuff?
  21. 21. 21 Decision #4 – State Saving • In-memory state with replication (Old VoltDB) • Requires lots of hardware and network • Local database (Samza, Kafka Streams API) • Remote database (Millwheel) • Upstream (i.e. replay everything on failure) • Global consistent snapshot (Flink)
  22. 22. 22 Decision #4 – State Saving Facebook Verdict: It depends Rhode Island Alaska
  23. 23. 23 Best Part of the Paper – by far How to efficiently work with state in remote DB?
  24. 24. 24 Decision #5 - Reprocessing • Stream only – requires long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark)
  25. 25. 25 Decision #5 - Reprocessing • Stream only – requires long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark) Facebook Verdict: SQL runs everywhere And binary generation FTW
  26. 26. 26 Applications – Or a whirlwind tour of good patterns One example:
  27. 27. 27 Lessons Learned! The biggest win is pipelines composed of independent processors • Mixing multiple systems let us move fast • High level abstractions let us improve implementation • Ease of debugging – Independent nodes and ability to replay • Ease of deployment – Puma as-a-service • Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box. • In the future – auto-scale based on lag
  28. 28. 28 Thank You!

×