A Practical Guide to Selecting a Stream Processing Technology

808 views

Published on

Presented by Michael Noll, Product Manager, Confluent.
Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all.
Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.

Published in: Technology

A Practical Guide to Selecting a Stream Processing Technology

  1. 1. A Practical Guide to Selecting a Stream Processing Technology Michael  G.  Noll Product  Manager,  Confluent
  2. 2. Kafka Talk Series Date Title Sep 27 Introduction  To  Streaming  Data  and  Stream  Processing  with  Apache  Kafka Oct  06 Deep  Dive  into  Apache  Kafka Oct  27 Data  Integration  with  Apache  Kafka Nov  17 Demystifying  Stream  Processing  with  Apache  Kafka Dec  01 A  Practical  Guide  to  Selecting  a  Stream  Processing  Technology Dec  15 Streaming  in  Practice:  Putting  Apache  Kafka  in  Production https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series
  3. 3. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  4. 4. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  5. 5. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  6. 6. Powered by Kafka (﴾thousands more)﴿
  7. 7. Spark Streaming API (﴾2.0)﴿
  8. 8. Kafka’s Streams API (﴾0.10)﴿
  9. 9. Example: Streams and Tables in Kafka Word Count hello 2 kafka 1 world 1 … …
  10. 10. Streams & Databases • A  stream  processing  technology  must  have  first-class   support  for Streams  and Tables • With  scalability,  fault  tolerance,  … • Why?  Because  most  use  cases  require  not  just  one,  but  both! • Support  – or  lack  thereof  – strongly  impacts  the  resulting   technical  architecture  and  development  efforts • No  support  means: • Painful  Do-It-Yourself • Increased  complexity,  more  moving  pieces  to  juggle
  11. 11. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  12. 12. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  13. 13. Organizational/Non-‐Tech Dimensions • Can  your  org  understand  and  leverage  the  technology? • Familiarity  with  languages;  intuitive  concepts  and  APIs;  trainings • Are  you  permitted  to  use  it  in  your  organization? • Security  features,  licensing,  open  source  vs.  proprietary • Can  you  continue  to  use  it  in  the  future? • Longevity  of  technology,  licensing,  vendor  strength
  14. 14. Organizational/Non-‐Tech Dimensions • Do  you  believe  in  the  long-term  vision? • Switching  technologies  in  an  organization  is  often  expensive/slow:   legacy  migration,  re-training,  resistance  to  change,  etc. • What  is  the  path  and  time  to  success? • Can  you  move  smoothly  and  quickly  from  proof-of-concept  to   production? • Areas  and  range  of  applicability in  your  organization • General-purpose  vs.  niche  technology • Viable  for  S/M/L/XL  use  cases  vs.  for  XL  use  cases  only • Building  core  business  apps  vs.  doing  backend  analytics
  15. 15. Organizational/Non-‐Tech Dimensions Licensing Vision/Roadmap ROI Impact  on Organization Broad  vs.  Niche Applicability Time  to  Market Professional Services Documentation Examples User  CommunityLearning  Curve Impact  on  Tools, Infrastructure,  …
  16. 16. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  17. 17. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  18. 18. State • Stateful  processing  of  any  kind  requires…state • Many  (most?)  use  cases  for  stream  processing  are  stateful • Joins,  aggregations,  windowing,  counting,  ... • Is  state  performant?  Local  vs.  remote  state? 50
  19. 19. State • Stateful  processing  of  any  kind  requires…state • Many  (most?)  use  cases  for  stream  processing  are  stateful • Joins,  aggregations,  windowing,  counting,  ... • Is  state  performant?  Local  vs.  remote  state? • Is  state  fault-tolerant?  How  fast  is  recovery/failover? 53
  20. 20. State • Stateful  processing  of  any  kind  requires…state • Many  (most?)  use  cases  for  stream  processing  are  stateful • Joins,  aggregations,  windowing,  counting,  ... • Is  state  performant?  Local  vs.  remote  state? • Is  state  fault-tolerant?  How  fast  is  recovery/failover? • Is  state  interactively  queryable? • Kafka:  ready  for  use  (GA) • Spark,  Flink:  under  development  (alpha) • Storm,  Samza,  and  others:  not  available 55
  21. 21. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  22. 22. Abstractions • What  are  the  data  model  and  the  available  abstractions? • Most  common  abstraction:  stream of  records,  events • Kafka,  Spark,  Storm,  Samza,  Flink,  Apex,  ... • New,  very  powerful:  table  of  records • Currently  unique  to  Kafka • Represents  latest  state and  materialized  views • State  must  have  a  first-class  abstraction  because,  as  we  just  saw  in   the  previous  section,  state  is  crucial  for  stream  processing! 58
  23. 23. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  24. 24. Time model • Different  use  cases  require  different  time  semantics • Great  majority  of  use  cases  require  event-time semantics • Other  use  cases  may  require  processing-time (e.g.  real- time  monitoring)  or  special  variants  like  ingestion-time • A  stream  processing  technology  should,  at  a  minimum,   support  event-time  to  cover  most  use  cases  in  practice • Examples:  Kafka,  Beam,  Flink
  25. 25. Time Model 61
  26. 26. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  27. 27. Windowing • Windowing  is  an  operation  that  groups events
  28. 28. Windowing Input  data,  where colors  represent different  users  events Rectangles  denote different  event-­‐time windows processing-­‐time event-­‐time windowing alice bob dave
  29. 29. Windowing • Windowing  is  an  operation  that  groups events • Most  commonly  needed:  time  windows,  session  windows • Examples: • Real-time  monitoring:  5-minute  averages • Reader  behavior  on  a  website:  user  browsing  sessions
  30. 30. Windowing
  31. 31. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  32. 32. Out-‐of-‐order and late-‐arriving data • Is  very  common in  practice,  not  a  rare  corner  case • Related  to  time  model  discussion
  33. 33. Out-‐of-‐order and late-‐arriving data Users  with  mobile  phones  enter airplane,  lose  Internet  connectivity Emails  are  being  written during  the  10h  flight Internet  connectivity  is  restored, phones  will  send  queued  emails  now
  34. 34. Out-‐of-‐order and late-‐arriving data • Is  very  common in  practice,  not  a  rare  corner  case • Related  to  time  model  discussion • We  want  control over  how  out-of-order  data  is  handled • Example: • We  process  data  in  5-minute  windows,  e.g.  compute  statistics • When  event  arrives  1  minute  late:  update the  original  result! • When  event  arrives  2  hours  late:  discard it! • Handling  must  be  efficient because  it  happens  so  often
  35. 35. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  36. 36. Reprocessing • Re-process  data  by  rewinding  a  stream  back  in  time • Use  cases  in  practice  include • Correcting  output  data  after  fixing  a  bug • Facilitate  iterative  and  explorative  development • A/B  testing • Processing  historical  data • Walking  through  "What  If?"  scenarios • Also:  often  used  behind-the-scenes  for  fault  tolerance
  37. 37. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  38. 38. Scalability, Elasticity, Fault Tolerance • Can  the  technology  scale according  to  your  needs? • Desired  latency,  throughput? • Able  to  process  millions  of  messages  per  second? • What  is  the  minimum  footprint? • Expand/shrink  capacity  dynamically  during  operations? • Helps  with  resource  utilization  because  most  stream  apps  run  continuously • Resilience and  fault  tolerance • Which  guarantees  for  data  delivery  and  for  state?  "At-least-once",  "exactly- once",  "effectively-once",  etc. • Failover  behavior  and  recovery  time?  Automated  or  manual? • Any  negative  impact  of  fault  tolerance  features  on  performance?
  39. 39. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  40. 40. Security • To  meet  internal  security  policies,  legal  compliance,  etc. • Typical  base  requirements  for  stream  processing  applications: • Encrypt  data-in-transit  (e.g.  from/to  Kafka) • Authentication:  "only  some  applications  may  talk  to  production" • Authorization:  "access  to  sensitive  data  such  as  PII  is  restricted” • The  easier  it  is  to  use  security  features,  the  more  likely  they  are   actually  being  used  in  practice
  41. 41. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  42. 42. Processing Model • True  stream  processing  is  record-at-a-time processing • Benefits  include  low  latency (millisecs),  dealing  efficiently  with  out-of-order  data • Can  provide  both  latency  and  high  throughput  via  internal  optimizations • Examples:  Kafka,  Storm,  Samza,  Flink,  Beam • Some  processing  technologies  opt  for  (micro)batching • Micro-batching  has  no  true  benefits:  consider  it  a  technical  workaround  to   shoehorn  stream-like  functionality  into  a  tool • Suffers  from  significant  overhead  when  dealing  with  e.g.  out-of-order/late-arriving   data,  when  performing  windowed  analyses  (e.g.  session  windows) • Typically  a  strong  blocker  for  use  cases  such  as  fraud  detection  or  anything  where   "a  few  seconds"  of  latency  is  prohibitive • Examples:  Spark,  Storm  (Trident),  Hadoop*
  43. 43. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  44. 44. API • Choice  of  API  is  a  subjective  matter  – skills,  preference,  … • Typical  options • Declarative,  expressive  API:  operations  like  map(),  filter() • Imperative,  lower-level  API:  callbacks  like  process(event) • Streaming  SQL:  STREAM  SELECT  …  FROM  …  WHERE  …   • In  the  best  case  you  get  not  just  one,  but  all  three • "Abstractions  are  great!" • "Abstractions  considered  harmful!"
  45. 45. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  46. 46. Developer/Operations Lifecycle • How  should  your  daily  work  look  and  feel  like? • "I  like  to  do  quick,  iterative  development"  (modify/test/repeat) • "I  want  to  decouple  team  roadmaps,  project  schedules" • Big  difference  between  App  Model  <->  Cluster  Model • Testing,  packaging,  deployment,  monitoring,  operations • "Do  I  need  to  know  Java  (app)  or  YARN  (cluster)  for  this?” • "I  want  reactive  processing  in  containers  that  run  on  Mesos!" • Rolling,  no-downtime  upgrades? • Integration  with  existing  Ops  infra,  tools,  processes?
  47. 47. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  48. 48. Summary • What  we  covered  is  a  good  starting  point • But,  no  free  lunch! • Understand  what  you  need,  and  weigh  criteria  appropriately • Think  end-to-end:  idea,  development,  operations,  troubleshooting • Think  big-picture:  future  use  cases,  architecture,  security,  training,  … • Do  your  own  internal  hackathons,  proof-of-concepts • Do  your  own  benchmarks • If  in  doubt:  simplicity  beats  complexity • Faster  to  learn,  easier  to  understand,  less  likely  to  fail,  …
  49. 49. Q&A Session 89
  50. 50. Coming Up Next Date Title Speaker Dec  15 Streaming in Practice: Putting Apache Kafka in Production Roger Hoover https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series

×