Modern Data Processing - Big Data Analytical Streaming Data Pipelines

Modern Data Processing / Big
Data Analytical Streaming Data
Pipelines

David Kjerrumgaard
Developer Advocate
● Apache Pulsar Committer | Author of Pulsar
In Action
● Former Principal Software Engineer on
Splunk’s messaging team responsible for
Splunk’s internal Pulsar-as-a-Service
platform
● Former Director of Solution Architecture at
Streamlio
2

Tim Spann
Developer Advocate
Tim Spann, Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFI Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar,
Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Python and more.
○ Today, he helps to grow the Pulsar community sharing rich technical
knowledge and experience at both global conferences and through
individual conversations.

Hosted by
Save Your Spot Now
Use code MODERNDATA20
to get 20% off.
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
5 Keynotes
12 Breakout Sessions
1 Amazing Happy Hour

Pulsar Summit
San Francisco
Sponsorship
Prospectus
Community Sponsorships Available
Help engage and connect the Apache Pulsar
community by becoming an official sponsor for
Pulsar Summit San Francisco 2022! Learn more
about the requirements and benefits of
becoming a community sponsor.
Hosted by

FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft

streamnative.io
Agenda
• Streaming Data Pipelines The Easy Way.
• Code / Demonstration.
https://tinyurl.com/bddpwjuf

streamnative.io
Our Pipeline Example

Apache Pulsar is a Cloud-Native
Messaging and Event-Streaming Platform.

101
Uniﬁed
Messaging
Platform
Guaranteed
Message
Delivery
Resiliency Inﬁnite
Scalability

What are the Beneﬁts of Pulsar?
Data Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model

The right API for async
12
Designed for teams, with
built in multi-tenancy
Power and ﬂexibility,
w/ support for
simultaneous streaming
and messaging use cases
Ideal for high-scale,
mission critical
microservices
Easy to use, with a
simple pub/sub API

streamnative.io
The Right Architecture for Unified Data
Pulsar’s Architecture
Decoupled compute and storage
● Separate compute and storage
has become standard practice
in cloud-native architectures
● Supports easy scale up/down.
Architectural Advantages
Brokers
Stateless compute
Bookies
message &
subscription state
Pluggable
Metadata
Store

● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Key Pulsar Concepts: Architecture
MetaData
Storage

Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
Uniﬁed Messaging
Model

Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named conﬁguration
rules that determine how messages are
delivered to consumers.
● Consumers receive messages.

Topics
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Cluster
Pulsar Cluster

Pulsar subscription modes
Different subscription modes have
different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active consumers,
no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover

Pulsar Terminology
Producer is a process that
publishes messages to a topic.
Consumer is a process that
establishes a subscription to a
topic and processes messages
published to that topic.
Subscription: A subscription is a
named conﬁguration rule that
determines how messages are
delivered to consumers.
Brokers handle the connections
and routes messages.
Instance is a group of clusters
that act together as a single
unit.
Cluster is a set of Pulsar
brokers, ZooKeeper quorum, and
an ensemble of BookKeeper
bookies.
Tenants are the administrative
unit for allocating capacity and
enforcing an authentication/
authorization scheme.
Namespaces are a grouping
mechanism for related topics.
Topics are named channels for
transmitting messages from
producers to consumers.
Messages belong to a topic and
contain an arbitrary payload.
BookKeeper log storage system
that Pulsar uses for durable
storage of all messages.
Bookie Stores messages and
cursors. Messages are grouped in
segments/ledgers.
ZooKeeper Stores metadata for
both Pulsar and BookKeeper,
also performs service discovery.

Demo
pyspark --packages io.delta:delta-core_2.12:1.2.1 --conf
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

IoT Data
IoT Ingestion: High-volume
streaming sources, sensors,
multiple message formats,
diverse protocols and
multi-vendor devices
creates data ingestion
challenges.
Other Sources: Transit data,
news, twitter, status feeds,
REST data, stock data and
more.

StreamNative Hub
StreamNative Cloud
Uniﬁed Batch and Stream COMPUTING
Batch
(Batch + Stream)
Uniﬁed Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Tiered Storage
Pulsar
---
Kafka
---
MQTT
---
Websocket
---
AMQP
Pulsar
Sink
Pulsar
Sink
Streaming
Edge Protocols
Modern Streaming Lakehouse Pipeline
Micro
Service

[Webinar]
Building Microservices
Watch Now Learn More
[Blog post]
Event-Driven Microservices

Now Available
On-Demand Pulsar
Training
Academy.StreamNative.io
35

Modern Data Processing - Big Data Analytical Streaming Data Pipelines

Recommended

Recommended

More Related Content

More from Timothy Spann

More from Timothy Spann (20)

Modern Data Processing - Big Data Analytical Streaming Data Pipelines