Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream or segment : what is the best way to access your events in Pulsar_Neng

59 views

Published on

Infinite event streams are the core data abstraction in Apache Pulsar. Pulsar provides two-level reading APIs for accessing events in Pulsar topics, one is pub/sub and the other one is segment readers. The pub/sub API provides a unified messaging API for accessing events in a streaming way. People can choose different subscription modes for consuming events. The segment API provides a way to access events directly from Apache BookKeeper and tiered storage, which is more suitable for batch-oriented workloads. You can combine both pub/sub API and segment API to create a unified data processing experience as well.
In the past year, we at StreamNative have been helping with many customers running Pulsar for different use cases from online queuing, event sourcing to stream and batch processing. We also worked on integrating Pulsar with different components in the big data ecosystem. In this talk, we will share our experiences and best practices of choosing the right API for accessing your event streams in Pulsar for different use cases.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Stream or segment : what is the best way to access your events in Pulsar_Neng

  1. 1. streamnative.io Stream/Segment - best way to access events in Pulsar Neng Lu
  2. 2. Who Am I ❏ StreamNative Software Engineer ❏ Ex-Twitter ❏ Contributed to Apache Projects - Heron, Pulsar ❏ Interested in event streaming technologies
  3. 3. Pulsar 1.X
  4. 4. Apache Pulsar “Flexible Pub/Sub Messaging Backed by Durable Log Storage”
  5. 5. Pulsar 2.X
  6. 6. Apache Pulsar “Cloud-native Messaging and Event Streaming Platform”
  7. 7. Pulsar Use Cases ❏ Unified Event Center/Bus (Queuing + Streaming) ❏ Billing Service ❏ Push Notification ❏ Worker Queue ❏ Logging Pipeline ❏ IoT ❏ Streaming-first, unified data processing
  8. 8. Data Processing with Apache Pulsar
  9. 9. Data Processing Categories ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance
  10. 10. Data Processing Categories ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance ❏ Streaming ❏ Long running jobs ❏ Time critical ❏ scalability as well as fault tolerant
  11. 11. Data Processing Categories ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance ❏ Streaming ❏ Long running jobs ❏ Time critical ❏ scalability as well as fault tolerant
  12. 12. Data Processing Categories ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance ❏ Streaming ❏ Long running jobs ❏ Time critical ❏ scalability as well as fault tolerant ❏ Serverless ❏ Simple, light-weight processing ❏ Processing data with high velocity
  13. 13. Apache Pulsar Layered Architecture Stateless Serving Durable Storage
  14. 14. Pulsar Messaging API ❏ Read data from brokers with different Subscription Modes ❏ Consume / Seek / Receive ❏ Reprocessing data by rewinding (seeking) the cursors
  15. 15. Subscription Mode ❏ Exclusive ❏ Failover ❏ Shared ❏ Key_Shared
  16. 16. Pulsar Segment API ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
  17. 17. Segment Centric Storage ❏ Topic Partition (Managed Ledger) ❏ The storage layer for a single topic partition ❏ Segment (Ledger) ❏ Single writer, append-only ❏ Replicated to multiple bookies
  18. 18. Tired Storage ❏ Long retention ❏ Low cost ❏ Easy to access
  19. 19. Apache Pulsar Data APIs Bookie1 Bookie2 Bookie3 Bookie4 Producer Consumer Broker 1 Broker 2 Broker 3 Bookie5 HADOOPGCSS3 Messaging API Segment API
  20. 20. Pulsar - Infinite Event Stream Storage
  21. 21. Pulsar - Topic
  22. 22. Pulsar - Topic Partitions
  23. 23. Pulsar - Segments
  24. 24. Pulsar - Stream
  25. 25. Pulsar - Infinite Event Stream Storage
  26. 26. Benefits ❏ Unlimited Topic Partition Storage ❏ Instant Scaling without Data Rebalancing ❏ Broker Failure Recovery ❏ Bookie Failure Recovery ❏ Cluster Expansion ❏ Low latency reading for messaging data ❏ High throughput reading for batch data ❏ Reduced cost for whole data storage
  27. 27. Pulsar SQL Case
  28. 28. Pulsar Flink Case Flink Job18 7 6 5 4 3 2 1 1 2 1 1 1 0 9 Flink Job2
  29. 29. Conclusion ❏ Apache Pulsar is a cloud-native messaging streaming system ❏ Multi layered architecture ❏ Segment centric storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Apache Pulsar provides a unified view of data
  30. 30. Community ❏ Pulsar Website: https://pulsar.apache.org ❏ Twitter: @apache_pulsar / @streamnativeio ❏ Slack: https://apache-pulsar.herokuapp.com ❏ Mailing Lists dev@pulsar.apache.org , users@pulsar.apache.org ❏ Github: https://github.com/apache/pulsar ❏ Medium: https://medium.com/streamnative
  31. 31. Thank You!
  32. 32. Reference ❏ http://pulsar.apache.org/docs/en/concepts-overview/ ❏ https://www.splunk.com/en_us/blog/it/comparing-pulsar-and-kafka-how-a-segment-based-architecture-delivers-better- performance-scalability-and- resilience.html#:~:text=Pulsar%20Architecture%20Basics&text=An%20Apache%20Pulsar%20cluster%20is,bookies%20that%2 0durably%20store%20messages. ❏ https://pulsar.apache.org/docs/en/concepts-tiered-storage/ ❏ https://flink.apache.org/2019/05/03/pulsar-flink.html

×