Modern Data Processing _ Big Data Analytical Streaming Data Pipelines
https://github.com/tspannhw/FLiP-Pi-DeltaLake-Thermal/blob/main/README.md
https://www.meetup.com/sf-big-analytics/events/286983210/
Details
Please register on the event website to receive your customized zoom joining link: https://www.aicamp.ai/event/eventdetails/W2022072612
(Our partner AICamp provides free Zoom service for our members)
Agenda:
12:00 - 12:05 pm members join online
12:05 - 1 pm talk + QA
1 pm – closing
Summary: In our meetup talk, we will show some best practices we have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and data feed.
In our modern data processing approach, we utilize several highly scalable open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar.
From there we build streaming ETL with Apache Spark, and enhance events with Pulsar Functions for ML and enrichment.
We build continuous queries against our topics with Flink SQL for aggregations, real-time alerts, and Delta Lake population.
With Slides, Demos, Q&A
Speakers: Timothy Spann and David Kjerrumgaard
Timothy Spann
Developer Advocate, StreamNative
Former Principal DataFlow Field Engineer at Cloudera
Former Senior Solutions Engineer at Hortonworks
Former Senior Field Engineer at Pivotal
DZone MVB Blogger
David Kjerrumgaard
Developer Advocate
Apache Pulsar Committer | Author of Pulsar In Action
Former Principal Software Engineer on Splunk’s messaging team Responsible for Splunk’s internal Pulsar-as-a-Service platform
Former Director of Solution Architecture at Streamlio
2. David Kjerrumgaard
Developer Advocate
● Apache Pulsar Committer | Author of Pulsar
In Action
● Former Principal Software Engineer on
Splunk’s messaging team responsible for
Splunk’s internal Pulsar-as-a-Service
platform
● Former Director of Solution Architecture at
Streamlio
2
3. Tim Spann
Developer Advocate
Tim Spann, Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFI Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar,
Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Python and more.
○ Today, he helps to grow the Pulsar community sharing rich technical
knowledge and experience at both global conferences and through
individual conversations.
4. Hosted by
Save Your Spot Now
Use code MODERNDATA20
to get 20% off.
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
5 Keynotes
12 Breakout Sessions
1 Amazing Happy Hour
5. Pulsar Summit
San Francisco
Sponsorship
Prospectus
Community Sponsorships Available
Help engage and connect the Apache Pulsar
community by becoming an official sponsor for
Pulsar Summit San Francisco 2022! Learn more
about the requirements and benefits of
becoming a community sponsor.
Hosted by
6. FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft
11. What are the Benefits of Pulsar?
Data Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model
12. The right API for async
12
Designed for teams, with
built in multi-tenancy
Power and flexibility,
w/ support for
simultaneous streaming
and messaging use cases
Ideal for high-scale,
mission critical
microservices
Easy to use, with a
simple pub/sub API
13. streamnative.io
The Right Architecture for Unified Data
Pulsar’s Architecture
Decoupled compute and storage
● Separate compute and storage
has become standard practice
in cloud-native architectures
● Supports easy scale up/down.
Architectural Advantages
Brokers
Stateless compute
Bookies
message &
subscription state
Pluggable
Metadata
Store
14. ● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Key Pulsar Concepts: Architecture
MetaData
Storage
16. Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named configuration
rules that determine how messages are
delivered to consumers.
● Consumers receive messages.
18. Pulsar subscription modes
Different subscription modes have
different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active consumers,
no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
19. Pulsar Terminology
Producer is a process that
publishes messages to a topic.
Consumer is a process that
establishes a subscription to a
topic and processes messages
published to that topic.
Subscription: A subscription is a
named configuration rule that
determines how messages are
delivered to consumers.
Brokers handle the connections
and routes messages.
Instance is a group of clusters
that act together as a single
unit.
Cluster is a set of Pulsar
brokers, ZooKeeper quorum, and
an ensemble of BookKeeper
bookies.
Tenants are the administrative
unit for allocating capacity and
enforcing an authentication/
authorization scheme.
Namespaces are a grouping
mechanism for related topics.
Topics are named channels for
transmitting messages from
producers to consumers.
Messages belong to a topic and
contain an arbitrary payload.
BookKeeper log storage system
that Pulsar uses for durable
storage of all messages.
Bookie Stores messages and
cursors. Messages are grouped in
segments/ledgers.
ZooKeeper Stores metadata for
both Pulsar and BookKeeper,
also performs service discovery.
27. IoT Data
IoT Ingestion: High-volume
streaming sources, sensors,
multiple message formats,
diverse protocols and
multi-vendor devices
creates data ingestion
challenges.
Other Sources: Transit data,
news, twitter, status feeds,
REST data, stock data and
more.