When apache pulsar meets apache flink

When Apache Pulsar meets Apache Flink
Sijie Guo (@sijieg)

Who am I
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Chair
❏ StreamNative Founder
❏ Ex-Twitter, Ex-Yahoo
❏ Interested in event streaming
technologies

“Flexible Pub/Sub Messaging
Backed by durable log storage”

A brief history of Apache Pulsar
❏ 2012: Pulsar idea started
❏ 5+ years on production, 100+ applications, 10+ data centers
❏ 2016/09 Yahoo open sourced Pulsar
❏ 2017/06 Yahoo donated Pulsar to ASF
❏ 2018/09 Pulsar graduated as a Top-Level project
❏ 25+ committers, 154 contributors, 900+ forks, 4000+ stars
❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, ...

Pulsar Use Cases
❏ Unified Event Center/Bus (Queuing + Streaming)
❏ Billing Service
❏ Push Notification
❏ Worker Queue
❏ Logging Pipeline
❏ IoT
❏ Streaming-first, unified data processing
❏ ...

Data Processing with Apache Pulsar

Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures

❏ Interactive
❏ Time critical
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance

❏ Interactive
❏ Time critical
❏ Batch
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures

❏ Interactive
❏ Time critical
❏ Batch
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures
❏ Serverless
❏ Simple, light-weight processing
❏ Processing data with high
velocity

Streaming-First
Batch processing is a special case of stream processing
A Flink view on computing

Infinite segmented streams
(pub/sub + segment)
A Pulsar view on data

+
=
Streaming-first, unified data processing

Pulsar - A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar - Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies

Pulsar - Infinite stream storage

Pulsar - Stream as a unified view on data

Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)

Unified data processing on Pulsar

Flink Integration
❏ Available Connectors
❏ Streaming Source
❏ Streaming Sink
❏ Table Sink
❏ Flink 1.6.0
When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html

Flink 1.9 Integration
❏ Pulsar Schema Integration
❏ Table API as first-class citizens
❏ Exactly-once source
❏ At-least-once sink

Pulsar Schema (1)
❏ Consensus of data at server-side
❏ Built-in schema registry
❏ Data schema on a per-topic basis
❏ Send and receive typed messages directly
❏ Validation
❏ Multi-version
❏ Schema evolution & compatibilities

Pulsar Schema (2)
// Create producer with Struct schema and send messages
Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create();
producer.newMessage()
.value(User.builder()
.userName("pulsar-user")
.userId(1L)
.build())
.send();
// Create consumer with Struct schema and receive messages
Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create();
consumer.receive();

Pulsar Schema (3) - SchemaInfo
{
"type": "JSON",
"schema": "{
"type":"record",
"name":"User",
"namespace":"com.foo",
"fields":[
{
"name":"file1",
"type":["null","string"],
"default":null
},
{
"name":"file2",
"type":"string",
"default":null
},
{
"name":"file3",
"type":["null","string"],
"default":"dfdf"
}
]
}",
"properties": {}
}

Pulsar Schema (6) - Compatibility Strategy

Pulsar Schema (7) - Multi versions

Pulsar-Flink (1) - Schema <-> Row
https://github.com/streamnative/pulsar-flink
❏ Topics without schema or with primitive schemas
❏ `value` field for message payload
❏ Topics with struct schemas (AVRO, JSON)
❏ Field names and types are kept in the row
❏ Metadata Fields
❏ __key: Binary
❏ __topic: String
❏ __messageId: Binary
❏ __publishTime: Timestamp
❏ __eventTime: Timestamp

Pulsar-Flink (2) - Schema Examples
Primitive Schema Avro Schema
https://github.com/streamnative/pulsar-flink

Pulsar-Flink (3) - Pulsar Source

Pulsar-Flink (4) - Streaming Tables

Pulsar-Flink (5) - Topic Partitions Discovery
❏ Find matching topics
❏ Fetch schemas for each topic
❏ Build schema-specific deserializer
❏ Each reader is responsible one
topic partition
❏ Each source task has a partition
discover task to check newly
added partitions

Pulsar-Flink (6) Exactly-once Source
❏ Message order on partition basis
❏ Seek & read
❏ Checkpoints with MessageID
❏ Durable cursor to keep
un-checkpointed messages alive
❏ Move cursor when a checkpoint is
completed

Pulsar-Flink (7) - Pulsar Sink

Pulsar-Flink (8) - Write to streaming tables

Future directions
❏ Unified Source API for both batch and streaming execution
❏ FLIP-27
❏ Pulsar as a catalog
❏ Pulsar as a state backend
❏ Scale-out source parallelism
❏ Key_Shared & Sticky consumer
❏ End-to-end exactly-once
❏ Pulsar transaction in 2.5.0

Key_Shared Subscription
❏ Key based ordering
❏ Key can be message key or a separated *order* key
❏ HashRing based routing
❏ Key based batcher
❏ Policies for messages without *keys*
https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared

Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Apache Flink provides a unified view of computing
❏ Pulsar + Flink for streaming-first, unified data processing

Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative

When apache pulsar meets apache flink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to When apache pulsar meets apache flink

Similar to When apache pulsar meets apache flink (20)

More from StreamNative

More from StreamNative (20)

Recently uploaded

Recently uploaded (20)

When apache pulsar meets apache flink