Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Query Pulsar Streams using Apache Flink

76 views

Published on

Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.

In this talk, Sijie Guo from the Apache Pulsar community will share the latest integrations between Apache Pulsar and Apache Flink. He will explain how Apache Flink can integrate and leverage Pulsar’s built-in efficient schemas to allow users of Flink SQL query Pulsar streams in realtime.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Query Pulsar Streams using Apache Flink

  1. 1. Query Pulsar Streams using Apache Flink Sijie Guo (@sijieg)
  2. 2. Who am I ❏ Apache Pulsar PMC Member ❏ Apache BookKeeper PMC Chair ❏ StreamNative Founder ❏ Ex-Twitter, Ex-Yahoo ❏ Interested in event streaming technologies
  3. 3. What is Apache Pulsar?
  4. 4. “Flexible Pub/Sub Messaging Backed by durable log storage”
  5. 5. Highlights ❏ Multi-Tenant Data System: Isolation, ACL, Policies ❏ Unified messaging model: Queuing + Streaming ❏ Infinite Segmented Stream Storage: Segment-centric, Tiered storage ❏ Structured Event Streams: Built-in schema management ❏ Cloud-Native Architecture: Simplified ops, Rebalance-free
  6. 6. A brief history of Apache Pulsar ❏ 2012: Pulsar idea started at Yahoo! ❏ 5+ years on production, 100+ applications, 10+ data centers ❏ 2016/09 Yahoo open sourced Pulsar ❏ 2017/06 Yahoo donated Pulsar to ASF ❏ 2018/09 Pulsar graduated as a Top-Level project ❏ 25+ committers, 168 contributors, 1000+ forks, 4200+ stars ❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, THG, OVH, … http://pulsar.apache.org/en/powered-by/
  7. 7. Pulsar Use Cases ❏ Billing / Payment / Trading Service ❏ Worker Queue / Push Notifications / Task Queue ❏ Unified Messaging Backbone (Queuing + Streaming) ❏ IoT ❏ Unified Data Processing
  8. 8. Pulsar at Tencent ❏ Billing Service (30+ billions) ❏ 500K QPS, 10 billions transaction requests ❏ 600+ Topics
  9. 9. Pulsar Use Cases ❏ Billing / Payment / Trade Service ❏ Worker Queue / Push Notifications / Task Queue ❏ Unified Messaging Backbone (Queuing + Streaming) ❏ IoT ❏ Unified Data Processing
  10. 10. Pulsar Use Cases ❏ Billing / Payment / Trade Service ❏ Worker Queue / Push Notifications / Task Queue ❏ Unified Messaging Backbone (Queuing + Streaming) ❏ IoT ❏ Unified Data Processing with Flink
  11. 11. Infinite segmented streams (pub/sub + segment) A Pulsar view on data
  12. 12. Pulsar - Pub/Sub
  13. 13. Pulsar - Topic Partitions
  14. 14. Pulsar - Segments
  15. 15. Pulsar - Stream
  16. 16. Pulsar - Infinite stream storage
  17. 17. Pulsar - Stream as a unified view on data
  18. 18. Pulsar - Two levels of reading API ❏ Pub/Sub (Streaming) ❏ Read data from brokers ❏ Consume / Seek / Receive ❏ Subscription Mode - Failover, Shared, Key_Shared ❏ Reprocessing data by rewinding (seeking) the cursors ❏ Segment (Batch) ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
  19. 19. Unified data processing on Pulsar
  20. 20. Unified Data Processing
  21. 21. Flink 1.6 Integration ❏ Available Connectors ❏ Streaming Source ❏ Streaming Sink ❏ Table Sink When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html
  22. 22. Flink Source
  23. 23. But that’s not cool ...
  24. 24. Flink Source
  25. 25. Flink 1.9 Integration ❏ Pulsar Schema Integration ❏ Table API as first-class citizens ❏ Exactly-once source ❏ At-least-once sink ❏ Flink Catalog Integration
  26. 26. Demo - Pulsar Catalog
  27. 27. Pulsar Schema (1) ❏ Consensus of data at server-side ❏ Built-in schema registry ❏ Data schema on a per-topic basis ❏ Send and receive typed messages directly ❏ Validation ❏ Multi-version ❏ Schema evolution & compatibilities
  28. 28. Pulsar Schema (2) // Create producer with Struct schema and send messages Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create(); producer.newMessage() .value(User.builder() .userName("pulsar-user") .userId(1L) .build()) .send(); // Create consumer with Struct schema and receive messages Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create(); consumer.receive();
  29. 29. Pulsar Schema (3) - SchemaInfo { "type": "JSON", "schema": "{ "type":"record", "name":"User", "namespace":"com.foo", "fields":[ { "name":"file1", "type":["null","string"], "default":null }, { "name":"file2", "type":"string", "default":null }, { "name":"file3", "type":["null","string"], "default":"dfdf" } ] }", "properties": {} }
  30. 30. Pulsar Schema (4) - Producer
  31. 31. Pulsar Schema (5) - Consumer
  32. 32. Pulsar Schema (6) - Compatibility Strategy
  33. 33. Pulsar Schema (7) - Multi versions
  34. 34. Pulsar-Flink (1) - Schema <-> Row https://github.com/streamnative/pulsar-flink ❏ Topics without schema or with primitive schemas ❏ `value` field for message payload ❏ Topics with struct schemas (AVRO, JSON) ❏ Field names and types are kept in the row ❏ Metadata Fields ❏ __key: Binary ❏ __topic: String ❏ __messageId: Binary ❏ __publishTime: Timestamp ❏ __eventTime: Timestamp
  35. 35. Pulsar-Flink (2) - Schema Examples Primitive Schema Avro Schema https://github.com/streamnative/pulsar-flink
  36. 36. Pulsar-Flink (3) - Pulsar Source https://github.com/streamnative/pulsar-flink
  37. 37. Pulsar-Flink (4) - Streaming Tables https://github.com/streamnative/pulsar-flink
  38. 38. Pulsar-Flink (5) - Pulsar Sink https://github.com/streamnative/pulsar-flink
  39. 39. Pulsar-Flink (6) - Write to streaming tables https://github.com/streamnative/pulsar-flink
  40. 40. Pulsar-Flink (7) - Pulsar Catalog https://github.com/streamnative/pulsar-flink
  41. 41. Lambda Batch Layer Speed/Streaming Layer Serving Layer
  42. 42. Unified Data Stack Unified Computing Unified Data Storage State-Centric
  43. 43. Future Work ❏ New Source API ❏ FLIP-27 ❏ Scale-out source parallelism ❏ Key_Shared & Sticky consumer ❏ End-to-end exactly-once ❏ Pulsar transaction in 2.5.0 ❏ Pulsar / BookKeeper as a state backend ❏ Schema-aware Offload / Tiered Storage
  44. 44. Key_Shared Subscription
  45. 45. Key_Shared Subscription ❏ Key based ordering ❏ Key can be message key or a separated *order* key ❏ HashRing based routing ❏ Key based batcher ❏ Policies for messages without *keys* https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared
  46. 46. Conclusion ❏ Apache Pulsar is a cloud-native streaming data storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Structured Event Streams via Pulsar Schema ❏ Pulsar is the unified data storage for Flink ❏ Pulsar + Flink for streaming-first, unified data processing stack
  47. 47. Community ❏ Pulsar Website: https://pulsar.apache.org https://streamnative.io ❏ Twitter: @apache_pulsar / @streamnativeio ❏ Slack: https://apache-pulsar.herokuapp.com ❏ Mailing Lists dev@pulsar.apache.org, users@pulsar.apache.org ❏ Github https://github.com/apache/pulsar ❏ Medium https://medium.com/streamnative
  48. 48. Pulsar at Europe ❏ First Pulsar Meetup at Paris (@OVHCloud) on Friday 10/11 ❏ https://www.meetup.com/Hadoop-U ser-Group-France/events/26492044 7/ ❏ If you are looking for collaborations on Pulsar events, talk to us :-)
  49. 49. Thanks!

×