TRACK: MODERN INFRASTRUCTURE
NOVEMBER 10, 2022
Timothy Spann, StreamNative
FLiP Stack for Cloud Data
Lakes
TRACK: MODERN INFRASTRUCTURE
Tim Spann
Developer Advocate
Tim Spann, Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar, Flink,
Spark, NiFi, Big Data, Cloud, MXNet, IoT and more.
○ Today, he helps to grow the Pulsar community sharing rich technical knowledge
and experience at both global conferences and through individual conversations.
TRACK: MODERN INFRASTRUCTURE
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar,
Apache NiFi, Apache Spark and open
source friends.
https://bit.ly/32dAJft
TRACK: MODERN INFRASTRUCTURE
TRACK: MODERN INFRASTRUCTURE
Apache Pulsar
Serverless computing framework.
Unbounded storage, multi-tiered
architecture, and tiered-storage.
Streaming & Pub/Sub messaging
semantics.
Multi-protocol support.
Open Source
Cloud-Native
TRACK: MODERN INFRASTRUCTURE
Why Apache Pulsar?
Unified
Messaging Platform
Guaranteed
Message Delivery
Resiliency
Infinite
Scalability
TRACK: MODERN INFRASTRUCTURE
Unified Messaging Model
Simplify your data infrastructure and
enable new use cases with queuing and
streaming capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the
same cluster, either via access control, or
in entirely different namespaces.
Scalability
Decoupled data computing and storage
enable horizontal scaling to handle data
scale and management complexity.
Geo-replication
Support for multi-datacenter replication
with both asynchronous and
synchronous replication for built-in
disaster recovery.
Tiered storage
Enable historical data to be offloaded to
cloud-native storage and store event
streams for indefinite periods of time.
Pulsar Benefits
TRACK: MODERN INFRASTRUCTURE
Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Unified Messaging + Data Streaming
TRACK: MODERN INFRASTRUCTURE
Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example,
sending one email message to many
recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
.. and Streaming
Works best in situations where the
order of messages is important—for
example, data ingestion.
Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for
consuming messages.
Pulsar: Unified Messaging + Data Streaming
TRACK: MODERN INFRASTRUCTURE
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Cluster
TRACK: MODERN INFRASTRUCTURE
Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named configuration rules
that determine how messages are delivered
to consumers.
● Consumers receive messages.
TRACK: MODERN INFRASTRUCTURE
Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer
D-2
Key-Shared
<
K
1
,V
1
0
>
<
K
1
,V
1
1
>
<
K
1
,V
1
2
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1
>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1
,V
1
0
>
<
K
2
,V
2
1
>
<
K
1
,V
1
2
>
<
K
2
,V
2
0
>
<
K
1
,V
1
1
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
TRACK: MODERN INFRASTRUCTURE
Producer-Consumer
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic
TRACK: MODERN INFRASTRUCTURE
Schema Registry
Schema Registry
schema-1
(value=Avro/Protobuf/JSON)
schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
TRACK: MODERN INFRASTRUCTURE
Use Pulsar to Stream to Lakehouses
TRACK: MODERN INFRASTRUCTURE
Use Pulsar to Stream from Lakehouses
TRACK: MODERN INFRASTRUCTURE
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a 300 sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Why Apache NiFi?
TRACK: MODERN INFRASTRUCTURE
Apache NiFi Pulsar Connector
https://github.com/david-streamlio/pulsar-nifi-bundle
TRACK: MODERN INFRASTRUCTURE
Apache NiFi - Data Lineage / Provenance
TRACK: MODERN INFRASTRUCTURE
https://www.datainmotion.dev/2021/01/automating-starting-services-in-apache.html
https://nipyapi.readthedocs.io/en/latest/
nifi-toolkit/bin/cli.sh nifi list-param-contexts -u http:/
/edge2ai-1.dim.local:8080
nifi-toolkit/bin/cli.sh nifi pg-list -u http:/
/edge2ai-1.dim.local:8080
nifi-toolkit/bin/cli.sh nifi pg-set-param-context -u http:/
/edge2ai-1.dim.local:8080 ...
Apache NiFi DevOps
Apache NiFi DevOps
TRACK: MODERN INFRASTRUCTURE
https://dev.to/tspannhw/automating-starting-services-in-apache-nifi-and-applying-parameters-5h4n
https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh
nifi pg-list
nifi pg-status
nifi pg-get-services
nifi pg-enable-services -u http:/
/edge2ai-1.dim.local:8080 --processGroupId root
nifi pg-start -u http:/
/edge2ai-1.dim.local:8080 -pgid LOOKTHISUP
nifi list-param-contexts -u http:/
/edge2ai-1.dim.local:8080 -verbose
nifi create-reporting-task -u http:/
/edge2ai-1.dim.local:8080 -verbose -i
Apache NiFi DevOps
TRACK: MODERN INFRASTRUCTURE
TRACK: MODERN INFRASTRUCTURE
Download NiFi Toolkit
Copy keystore and truststore information from your NiFi conf/nifi.properties
Create a nifi.properties file linked to the cli.sh
baseUrl=https://nvidia-desktop:8443
keystore=/home/nvidia/nvme/nifi-1.15.3/conf/keystore.p12
keystoreType=PKCS12
keystorePasswd=5325343412efaab3123c6892d93
keyPasswd=53134eee99da9dbe9349123aa17c6892d93
truststore=/home/nvidia/nvme/nifi-1.15.3/conf/truststore.p12
truststoreType=PKCS12
truststorePasswd=93498Dfdjfhujdhure8d8hfd84j3n43jd
Apache NiFi Toolkit Setup
TRACK: MODERN INFRASTRUCTURE
● Unified computing engine
● Batch processing is a special case of stream processing
● Stateful processing
● Massive Scalability
● Flink SQL for queries, inserts against Pulsar Topics
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
Why Apache Flink?
TRACK: MODERN INFRASTRUCTURE
SQL
select aqi, parameterName, dateObserved, hourObserved, latitude, longitude, localTimeZone,
stateCode, reportingArea from airquality;
select max(aqi) as MaxAQI, parameterName, reportingArea from airquality group by
parameterName, reportingArea;
select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as AvgAQI, count(aqi) as RowCount,
parameterName, reportingArea from airquality group by parameterName, reportingArea;
TRACK: MODERN INFRASTRUCTURE
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream
COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream
STORAGE
Offload
(Queuing + Streaming)
Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Streaming
Data Gateway
Protocols
Data to Cloud Data Lake
Micro
Service
TRACK: MODERN INFRASTRUCTURE
Monitoring and Metrics Check
curl http://localhost:8080/admin/v2/persistent/conf/ete/first/stats |
python3 -m json.tool
bin/pulsar-admin topics stats-internal persistent://conf/ete/first
curl http://pulsar1:8080/metrics/
bin/pulsar-admin topics stats-internal persistent://conf/ete/first
bin/pulsar-admin topics peek-messages --count 5 --subscription ete-reader
persistent://conf/ete/first
bin/pulsar-admin topics subscriptions persistent://conf/ete/first
TRACK: MODERN INFRASTRUCTURE
Cleanup
bin/pulsar-admin topics delete persistent://conf/ete/first
bin/pulsar-admin namespaces delete conf/ete
bin/pulsar-admin tenants delete conf
TRACK: MODERN INFRASTRUCTURE
Metrics: Broker
Broker metrics are exposed under "/metrics"
at port 8080.
You can change the port by updating
webServicePort to a different port in the
broker.conf configuration file.
All the metrics exposed by a broker are labeled
with cluster=${pulsar_cluster}.
The name of Pulsar cluster is the value of
${pulsar_cluster}, configured in the
broker.conf file.
These metrics are available for brokers:
● Namespace metrics
○ Replication metrics
● Topic metrics
○ Replication metrics
● ManagedLedgerCache metrics
● ManagedLedger metrics
● LoadBalancing metrics
○ BundleUnloading metrics
○ BundleSplit metrics
● Subscription metrics
● Consumer metrics
● ManagedLedger bookie client metrics
For more information: https://pulsar.apache.org/docs/en/reference-metrics/#broker
TRACK: MODERN INFRASTRUCTURE
Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PaaSDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw
TRACK: MODERN INFRASTRUCTURE

All Day DevOps - FLiP Stack for Cloud Data Lakes

  • 1.
    TRACK: MODERN INFRASTRUCTURE NOVEMBER10, 2022 Timothy Spann, StreamNative FLiP Stack for Cloud Data Lakes
  • 2.
    TRACK: MODERN INFRASTRUCTURE TimSpann Developer Advocate Tim Spann, Developer Advocate at StreamNative ● FLiP(N) Stack = Flink, Pulsar and NiFi Stack ● Streaming Systems & Data Architecture Expert ● Experience: ○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT and more. ○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience at both global conferences and through individual conversations.
  • 3.
    TRACK: MODERN INFRASTRUCTURE FLiPStack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  • 4.
  • 5.
    TRACK: MODERN INFRASTRUCTURE ApachePulsar Serverless computing framework. Unbounded storage, multi-tiered architecture, and tiered-storage. Streaming & Pub/Sub messaging semantics. Multi-protocol support. Open Source Cloud-Native
  • 6.
    TRACK: MODERN INFRASTRUCTURE WhyApache Pulsar? Unified Messaging Platform Guaranteed Message Delivery Resiliency Infinite Scalability
  • 7.
    TRACK: MODERN INFRASTRUCTURE UnifiedMessaging Model Simplify your data infrastructure and enable new use cases with queuing and streaming capabilities in one platform. Multi-tenancy Enable multiple user groups to share the same cluster, either via access control, or in entirely different namespaces. Scalability Decoupled data computing and storage enable horizontal scaling to handle data scale and management complexity. Geo-replication Support for multi-datacenter replication with both asynchronous and synchronous replication for built-in disaster recovery. Tiered storage Enable historical data to be offloaded to cloud-native storage and store event streams for indefinite periods of time. Pulsar Benefits
  • 8.
    TRACK: MODERN INFRASTRUCTURE Messaging Idealfor work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Pulsar: Unified Messaging + Data Streaming
  • 9.
    TRACK: MODERN INFRASTRUCTURE Messaging Idealfor work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. .. and Streaming Works best in situations where the order of messages is important—for example, data ingestion. Kafka and Amazon Kinesis are examples of messaging systems that use streaming semantics for consuming messages. Pulsar: Unified Messaging + Data Streaming
  • 10.
    TRACK: MODERN INFRASTRUCTURE Tenants (Compliance) Tenants (DataServices) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Cluster
  • 11.
    TRACK: MODERN INFRASTRUCTURE Pulsar’sPublish-Subscribe model Broker Subscription Consumer 1 Consumer 2 Consumer 3 Topic Producer 1 Producer 2 ● Producers send messages. ● Topics are an ordered, named channel that producers use to transmit messages to subscribed consumers. ● Messages belong to a topic and contain an arbitrary payload. ● Brokers handle connections and routes messages between producers / consumers. ● Subscriptions are named configuration rules that determine how messages are delivered to consumers. ● Consumers receive messages.
  • 12.
    TRACK: MODERN INFRASTRUCTURE PulsarSubscription Modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1 ,V 1 0 > < K 1 ,V 1 1 > < K 1 ,V 1 2 > < K 2 ,V 2 0 > < K 2 ,V 2 1 > < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1 ,V 1 0 > < K 2 ,V 2 1 > < K 1 ,V 1 2 > < K 2 ,V 2 0 > < K 1 ,V 1 1 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  • 13.
    TRACK: MODERN INFRASTRUCTURE Producer-Consumer ProducerConsumer Publisher sends data and doesn't know about the subscribers or their status. All interactions go through Pulsar and it handles all communication. Subscriber receives data from publisher and never directly interacts with it Topic Topic
  • 14.
    TRACK: MODERN INFRASTRUCTURE SchemaRegistry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  • 15.
    TRACK: MODERN INFRASTRUCTURE UsePulsar to Stream to Lakehouses
  • 16.
    TRACK: MODERN INFRASTRUCTURE UsePulsar to Stream from Lakehouses
  • 17.
    TRACK: MODERN INFRASTRUCTURE •Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Over a 300 sources • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control Why Apache NiFi?
  • 18.
    TRACK: MODERN INFRASTRUCTURE ApacheNiFi Pulsar Connector https://github.com/david-streamlio/pulsar-nifi-bundle
  • 19.
    TRACK: MODERN INFRASTRUCTURE ApacheNiFi - Data Lineage / Provenance
  • 20.
    TRACK: MODERN INFRASTRUCTURE https://www.datainmotion.dev/2021/01/automating-starting-services-in-apache.html https://nipyapi.readthedocs.io/en/latest/ nifi-toolkit/bin/cli.shnifi list-param-contexts -u http:/ /edge2ai-1.dim.local:8080 nifi-toolkit/bin/cli.sh nifi pg-list -u http:/ /edge2ai-1.dim.local:8080 nifi-toolkit/bin/cli.sh nifi pg-set-param-context -u http:/ /edge2ai-1.dim.local:8080 ... Apache NiFi DevOps Apache NiFi DevOps
  • 21.
    TRACK: MODERN INFRASTRUCTURE https://dev.to/tspannhw/automating-starting-services-in-apache-nifi-and-applying-parameters-5h4n https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh nifipg-list nifi pg-status nifi pg-get-services nifi pg-enable-services -u http:/ /edge2ai-1.dim.local:8080 --processGroupId root nifi pg-start -u http:/ /edge2ai-1.dim.local:8080 -pgid LOOKTHISUP nifi list-param-contexts -u http:/ /edge2ai-1.dim.local:8080 -verbose nifi create-reporting-task -u http:/ /edge2ai-1.dim.local:8080 -verbose -i Apache NiFi DevOps
  • 22.
  • 23.
    TRACK: MODERN INFRASTRUCTURE DownloadNiFi Toolkit Copy keystore and truststore information from your NiFi conf/nifi.properties Create a nifi.properties file linked to the cli.sh baseUrl=https://nvidia-desktop:8443 keystore=/home/nvidia/nvme/nifi-1.15.3/conf/keystore.p12 keystoreType=PKCS12 keystorePasswd=5325343412efaab3123c6892d93 keyPasswd=53134eee99da9dbe9349123aa17c6892d93 truststore=/home/nvidia/nvme/nifi-1.15.3/conf/truststore.p12 truststoreType=PKCS12 truststorePasswd=93498Dfdjfhujdhure8d8hfd84j3n43jd Apache NiFi Toolkit Setup
  • 24.
    TRACK: MODERN INFRASTRUCTURE ●Unified computing engine ● Batch processing is a special case of stream processing ● Stateful processing ● Massive Scalability ● Flink SQL for queries, inserts against Pulsar Topics ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite Why Apache Flink?
  • 25.
    TRACK: MODERN INFRASTRUCTURE SQL selectaqi, parameterName, dateObserved, hourObserved, latitude, longitude, localTimeZone, stateCode, reportingArea from airquality; select max(aqi) as MaxAQI, parameterName, reportingArea from airquality group by parameterName, reportingArea; select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as AvgAQI, count(aqi) as RowCount, parameterName, reportingArea from airquality group by parameterName, reportingArea;
  • 26.
    TRACK: MODERN INFRASTRUCTURE StreamNativeHub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores Tiered Storage Pulsar --- KoP --- MoP --- Websocket --- HTTP Pulsar Sink Pulsar Sink Streaming Data Gateway Protocols Data to Cloud Data Lake Micro Service
  • 27.
    TRACK: MODERN INFRASTRUCTURE Monitoringand Metrics Check curl http://localhost:8080/admin/v2/persistent/conf/ete/first/stats | python3 -m json.tool bin/pulsar-admin topics stats-internal persistent://conf/ete/first curl http://pulsar1:8080/metrics/ bin/pulsar-admin topics stats-internal persistent://conf/ete/first bin/pulsar-admin topics peek-messages --count 5 --subscription ete-reader persistent://conf/ete/first bin/pulsar-admin topics subscriptions persistent://conf/ete/first
  • 28.
    TRACK: MODERN INFRASTRUCTURE Cleanup bin/pulsar-admintopics delete persistent://conf/ete/first bin/pulsar-admin namespaces delete conf/ete bin/pulsar-admin tenants delete conf
  • 29.
    TRACK: MODERN INFRASTRUCTURE Metrics:Broker Broker metrics are exposed under "/metrics" at port 8080. You can change the port by updating webServicePort to a different port in the broker.conf configuration file. All the metrics exposed by a broker are labeled with cluster=${pulsar_cluster}. The name of Pulsar cluster is the value of ${pulsar_cluster}, configured in the broker.conf file. These metrics are available for brokers: ● Namespace metrics ○ Replication metrics ● Topic metrics ○ Replication metrics ● ManagedLedgerCache metrics ● ManagedLedger metrics ● LoadBalancing metrics ○ BundleUnloading metrics ○ BundleSplit metrics ● Subscription metrics ● Consumer metrics ● ManagedLedger bookie client metrics For more information: https://pulsar.apache.org/docs/en/reference-metrics/#broker
  • 30.
    TRACK: MODERN INFRASTRUCTURE Let’sKeep in Touch! Tim Spann Developer Advocate @PaaSDev https://www.linkedin.com/in/timothyspann https://github.com/tspannhw
  • 31.