Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[DataSciCon] Divide, distribute and conquer stream v. batch

87 views

Published on

Data is flowing everywhere around us, from phones, credit cards, sensor-equipped buildings, vending machines, thermostats, trains, buses, planes, posts to social media, digital pictures and video and so on....

http://www.datascicon.tech

Published in: Software
  • Be the first to comment

[DataSciCon] Divide, distribute and conquer stream v. batch

  1. 1. DIVIDE, DISTRIBUTE AND CONQUER:
 STREAM V. BATCH
  2. 2. Stream v. Batch
  3. 3. Who am I?
  4. 4. Solutions Architect Who am I?
  5. 5. Solutions Architect Developer Advocate Who am I?
  6. 6. Solutions Architect Developer Advocate @gamussa in internetz Who am I?
  7. 7. Solutions Architect Developer Advocate @gamussa in internetz Hey you, yes, you, go follow me in twitter © Who am I?
  8. 8. @gamussa @confluentinc @DataSciCon BATCH PROCESSING Data at rest
  9. 9. @gamussa @confluentinc @DataSciCon Data and Queries Origin and processing
  10. 10. @gamussa @confluentinc @DataSciCon
  11. 11. @gamussa @confluentinc @DataSciCon Data…
  12. 12. @gamussa @confluentinc @DataSciCon Data…
  13. 13. @gamussa @confluentinc @DataSciCon ✓ … inherently immutable Data… ✓ … time-based
  14. 14. @gamussa @confluentinc @DataSciCon CRUD -> CR
  15. 15. @gamussa @confluentinc @DataSciCon Processing is a query
  16. 16. @gamussa @confluentinc @DataSciCon Processing is a query Function on full data set
  17. 17. @gamussa @confluentinc @DataSciCon Processing is a query Function on full data set Projection
  18. 18. @gamussa @confluentinc @DataSciCon Processing is a query Function on full data set Projection Aggregations
  19. 19. @gamussa @confluentinc @DataSciCon Processing is a query Function on full data set Projection Aggregations Joins
  20. 20. @gamussa @confluentinc @DataSciCon Lambda architecture origins http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
  21. 21. @gamussa @confluentinc @DataSciCon https://mapr.com/developercentral/lambda-architecture/ Lambda Architecture
  22. 22. @gamussa @confluentinc @DataSciCon
  23. 23. @gamussa @confluentinc @DataSciCon TFW Trying to explain modern big data landscape
  24. 24. @gamussa @confluentinc @DataSciCon
  25. 25. @gamussa @confluentinc @DataSciCon STREAM PROCESSING Data is motion
  26. 26. @gamussa @confluentinc @DataSciCon Streaming Platform
  27. 27. @gamussa @confluentinc @DataSciCon Streaming Platform
  28. 28. @gamussa @confluentinc @DataSciCon
  29. 29. @gamussa @confluentinc @DataSciCon Interesting cases Before You Go
  30. 30. I FOUND YOUR LACK OF FAULT TOLERANCE DISTURBING
  31. 31. Data is too important to store it in one computer
  32. 32. @gamussa @confluentinc @DataSciCon How to process «infinite» data?
  33. 33. @gamussa @confluentinc @DataSciCon Time model
  34. 34. @gamussa @confluentinc @DataSciCon Time model Different use cases time semantics
  35. 35. @gamussa @confluentinc @DataSciCon Time model Different use cases time semantics Majority of use cases require event- time semantics
  36. 36. @gamussa @confluentinc @DataSciCon Time model Different use cases time semantics Majority of use cases require event- time semantics Other use cases may require processing-time or special variants like ingestion-time
  37. 37. @gamussa @confluentinc @DataSciCon Time Model
  38. 38. @gamussa @confluentinc @DataSciCon Time Model
  39. 39. @gamussa @confluentinc @DataSciCon Time Model
  40. 40. @gamussa @confluentinc @DataSciCon Windowing Input data, where colors represent
 different users events Rectangles denote
 different event-time
 windows processing-time event-time windowing alice bob dave
  41. 41. @gamussa @confluentinc @DataSciCon https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  42. 42. @gamussa @confluentinc @DataSciCon Windowing Windowing is an operation that groups events Most commonly needed: time windows, session windows Examples: ✗Real-time monitoring: 5-minute averages ✗Reader behavior on a website: user browsing sessions
  43. 43. @gamussa @confluentinc @DataSciCon Out-of-order and late data Is very common in practice, not a rare corner case ✗Related to time model discussion
  44. 44. @gamussa @confluentinc @DataSciCon Out-of-order and late data
  45. 45. @gamussa @confluentinc @DataSciCon Out-of-order and late data Users with mobile phones enter
 airplane, lose Internet connectivity
  46. 46. @gamussa @confluentinc @DataSciCon Out-of-order and late data Users with mobile phones enter
 airplane, lose Internet connectivity Emails are being written
 during the 10h flight
  47. 47. @gamussa @confluentinc @DataSciCon Out-of-order and late data Users with mobile phones enter
 airplane, lose Internet connectivity Emails are being written
 during the 10h flight Internet connectivity is restored,
 phones will send queued emails now
  48. 48. @gamussa @confluentinc @DataSciCon Stream Processing: results
  49. 49. @gamussa @confluentinc @DataSciCon Stream Processing: results • Yes, it’s possible to get computation results in real time
  50. 50. @gamussa @confluentinc @DataSciCon Stream Processing: results • Yes, it’s possible to get computation results in real time • Windows – finite view of infinite data • Based on temporal characteristics of the evet
  51. 51. @gamussa @confluentinc @DataSciCon Stream Processing: results • Yes, it’s possible to get computation results in real time • Windows – finite view of infinite data • Based on temporal characteristics of the evet • Late event processing • You choose how long to wait
  52. 52. @gamussa @confluentinc @DataSciCon DEMO Let’s analyze flights
  53. 53. @gamussa @confluentinc @DataSciCon https://www.confluent.io/blog/predicting-flight-arrivals-with-the-apache-kafka-streams-api/
  54. 54. @gamussa @confluentinc @DataSciCon Example: Training Flight Prediction Model
  55. 55. @gamussa @confluentinc @DataSciCon https://github.com/confluentinc/online-inferencing-blog- application
  56. 56. @gamussa @confluentinc @DataSciCon Thanks! questions? @gamussa viktor@confluent.io

×