Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

When apache pulsar meets apache flink

563 views

Published on

Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.

In this talk, Sijie Guo from Apache Pulsar community will given an overview of Apache Pulsar and how it provides the unified data view to fully leverage Apache Flink unified computation runtime for elastic data processing. He will share the latest integrations between Apache Pulsar and Apache Flink, especially around effectively-once processing and schema integration.

Published in: Technology
  • Be the first to comment

When apache pulsar meets apache flink

  1. 1. When Apache Pulsar meets Apache Flink Sijie Guo (@sijieg)
  2. 2. Who am I ❏ Apache Pulsar PMC Member ❏ Apache BookKeeper PMC Chair ❏ StreamNative Founder ❏ Ex-Twitter, Ex-Yahoo ❏ Interested in event streaming technologies
  3. 3. What is Apache Pulsar?
  4. 4. “Flexible Pub/Sub Messaging Backed by durable log storage”
  5. 5. A brief history of Apache Pulsar ❏ 2012: Pulsar idea started ❏ 5+ years on production, 100+ applications, 10+ data centers ❏ 2016/09 Yahoo open sourced Pulsar ❏ 2017/06 Yahoo donated Pulsar to ASF ❏ 2018/09 Pulsar graduated as a Top-Level project ❏ 25+ committers, 154 contributors, 900+ forks, 4000+ stars ❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, ...
  6. 6. Pulsar 1.x
  7. 7. Pulsar Use Cases ❏ Unified Event Center/Bus (Queuing + Streaming) ❏ Billing Service ❏ Push Notification ❏ Worker Queue ❏ Logging Pipeline ❏ IoT ❏ Streaming-first, unified data processing ❏ ...
  8. 8. Pulsar 2.x
  9. 9. Pulsar Use Cases ❏ Unified Event Center/Bus (Queuing + Streaming) ❏ Billing Service ❏ Push Notification ❏ Worker Queue ❏ Logging Pipeline ❏ IoT ❏ Streaming-first, unified data processing ❏ ...
  10. 10. Data Processing with Apache Pulsar
  11. 11. Data Processing Categories ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures
  12. 12. Data Processing Categories ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance
  13. 13. Data Processing Categories ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance ❏ Streaming ❏ Long running jobs ❏ Time critical ❏ Need scalability as well as resilient on failures
  14. 14. Data Processing Categories ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance ❏ Streaming ❏ Long running jobs ❏ Time critical ❏ Need scalability as well as resilient on failures ❏ Serverless ❏ Simple, light-weight processing ❏ Processing data with high velocity
  15. 15. Streaming-First Batch processing is a special case of stream processing A Flink view on computing
  16. 16. Infinite segmented streams (pub/sub + segment) A Pulsar view on data
  17. 17. + = Streaming-first, unified data processing
  18. 18. Why Pulsar fits well in Flink
  19. 19. Pulsar - A cloud-native architecture Stateless Serving Durable Storage
  20. 20. Pulsar - Segment Centric Storage ❏ Topic Partition (Managed Ledger) ❏ The storage layer for a single topic partition ❏ Segment (Ledger) ❏ Single writer, append-only ❏ Replicated to multiple bookies
  21. 21. Pulsar - Infinite stream storage
  22. 22. Pulsar - Pub/Sub
  23. 23. Pulsar - Topic Partitions
  24. 24. Pulsar - Segments
  25. 25. Pulsar - Stream
  26. 26. Pulsar - Stream as a unified view on data
  27. 27. Pulsar - Two levels of reading API ❏ Pub/Sub (Streaming) ❏ Read data from brokers ❏ Consume / Seek / Receive ❏ Subscription Mode - Failover, Shared, Key_Shared ❏ Reprocessing data by rewinding (seeking) the cursors ❏ Segment (Batch) ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
  28. 28. Unified data processing on Pulsar
  29. 29. When Meets
  30. 30. Flink Integration ❏ Available Connectors ❏ Streaming Source ❏ Streaming Sink ❏ Table Sink ❏ Flink 1.6.0 When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html
  31. 31. Flink 1.9 Integration ❏ Pulsar Schema Integration ❏ Table API as first-class citizens ❏ Exactly-once source ❏ At-least-once sink
  32. 32. Pulsar Schema (1) ❏ Consensus of data at server-side ❏ Built-in schema registry ❏ Data schema on a per-topic basis ❏ Send and receive typed messages directly ❏ Validation ❏ Multi-version ❏ Schema evolution & compatibilities
  33. 33. Pulsar Schema (2) // Create producer with Struct schema and send messages Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create(); producer.newMessage() .value(User.builder() .userName("pulsar-user") .userId(1L) .build()) .send(); // Create consumer with Struct schema and receive messages Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create(); consumer.receive();
  34. 34. Pulsar Schema (3) - SchemaInfo { "type": "JSON", "schema": "{ "type":"record", "name":"User", "namespace":"com.foo", "fields":[ { "name":"file1", "type":["null","string"], "default":null }, { "name":"file2", "type":"string", "default":null }, { "name":"file3", "type":["null","string"], "default":"dfdf" } ] }", "properties": {} }
  35. 35. Pulsar Schema (4) - Producer
  36. 36. Pulsar Schema (5) - Consumer
  37. 37. Pulsar Schema (6) - Compatibility Strategy
  38. 38. Pulsar Schema (7) - Multi versions
  39. 39. Pulsar-Flink (1) - Schema <-> Row https://github.com/streamnative/pulsar-flink ❏ Topics without schema or with primitive schemas ❏ `value` field for message payload ❏ Topics with struct schemas (AVRO, JSON) ❏ Field names and types are kept in the row ❏ Metadata Fields ❏ __key: Binary ❏ __topic: String ❏ __messageId: Binary ❏ __publishTime: Timestamp ❏ __eventTime: Timestamp
  40. 40. Pulsar-Flink (2) - Schema Examples Primitive Schema Avro Schema https://github.com/streamnative/pulsar-flink
  41. 41. Pulsar-Flink (3) - Pulsar Source
  42. 42. Pulsar-Flink (4) - Streaming Tables
  43. 43. Pulsar-Flink (5) - Topic Partitions Discovery ❏ Find matching topics ❏ Fetch schemas for each topic ❏ Build schema-specific deserializer ❏ Each reader is responsible one topic partition ❏ Each source task has a partition discover task to check newly added partitions
  44. 44. Pulsar-Flink (6) Exactly-once Source ❏ Message order on partition basis ❏ Seek & read ❏ Checkpoints with MessageID ❏ Durable cursor to keep un-checkpointed messages alive ❏ Move cursor when a checkpoint is completed
  45. 45. Pulsar-Flink (7) - Pulsar Sink
  46. 46. Pulsar-Flink (8) - Write to streaming tables
  47. 47. Future directions ❏ Unified Source API for both batch and streaming execution ❏ FLIP-27 ❏ Pulsar as a catalog ❏ Pulsar as a state backend ❏ Scale-out source parallelism ❏ Key_Shared & Sticky consumer ❏ End-to-end exactly-once ❏ Pulsar transaction in 2.5.0
  48. 48. Key_Shared Subscription
  49. 49. Key_Shared Subscription ❏ Key based ordering ❏ Key can be message key or a separated *order* key ❏ HashRing based routing ❏ Key based batcher ❏ Policies for messages without *keys* https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared
  50. 50. Conclusion ❏ Apache Pulsar is a cloud-native messaging streaming system ❏ Multi layered architecture ❏ Segment centric storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Apache Pulsar provides a unified view of data ❏ Apache Flink provides a unified view of computing ❏ Pulsar + Flink for streaming-first, unified data processing
  51. 51. Unified Data Processing
  52. 52. Community ❏ Pulsar Website: https://pulsar.apache.org ❏ Twitter: @apache_pulsar / @streamnativeio ❏ Slack: https://apache-pulsar.herokuapp.com ❏ Mailing Lists dev@pulsar.apache.org, users@pulsar.apache.org ❏ Github https://github.com/apache/pulsar ❏ Medium https://medium.com/streamnative
  53. 53. Thanks!

×