Fast Streaming into Clickhouse with Apache Pulsar

Timothy Spann
Timothy SpannDeveloper Advocate
Fast Streaming into Clickhouse
With Apache Pulsar
Tim Spann
Developer Advocate
Tim Spann, Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFI Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar,
Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Python and more.
○ Today, he helps to grow the Pulsar community sharing rich technical
knowledge and experience at both global conferences and through
individual conversations.
https://streamnative.io/pulsar-python/
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft
streamnative.io
Agenda
• What is Apache Pulsar? 2.10!
• Code / Demonstration.
Apache Pulsar is a Cloud-Native
Messaging and Event-Streaming Platform.
The right API for async
6
Designed for teams, with
built in multi-tenancy
Power and flexibility,
w/ support for
simultaneous streaming
and messaging use cases
Ideal for high-scale,
mission critical
microservices
Easy to use, with a
simple pub/sub API
streamnative.io
What are the Benefits of Pulsar?
Data Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model
Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named configuration
rules that determine how messages are
delivered to consumers.
● Consumers receive messages.
Topics
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Instance
Pulsar Cluster
Pulsar subscription modes
Different subscription modes have
different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active consumers,
no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
Pulsar Terminology
Producer is a process that
publishes messages to a topic.
Consumer is a process that
establishes a subscription to a
topic and processes messages
published to that topic.
Subscription: A subscription is a
named configuration rule that
determines how messages are
delivered to consumers.
Brokers handle the connections
and routes messages.
Instance is a group of clusters
that act together as a single
unit.
Cluster is a set of Pulsar
brokers, ZooKeeper quorum, and
an ensemble of BookKeeper
bookies.
Tenants are the administrative
unit for allocating capacity and
enforcing an authentication/
authorization scheme.
Namespaces are a grouping
mechanism for related topics.
Topics are named channels for
transmitting messages from
producers to consumers.
Messages belong to a topic and
contain an arbitrary payload.
BookKeeper log storage system
that Pulsar uses for durable
storage of all messages.
Bookie Stores messages and
cursors. Messages are grouped in
segments/ledgers.
ZooKeeper Stores metadata for
both Pulsar and BookKeeper,
also performs service discovery.
Kafka
On Pulsar
(KoP)
MQTT
On Pulsar
(MoP)
RocketMQ
On Pulsar
(RoP)
AMQP
On Pulsar
(AoP)
Art by Victoria Spann
Moving Data Out of Pulsar to Clickhouse
IO/Connectors are a simple way to integrate with external systems and move data
in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/
● Built on top of Pulsar Functions
● Built-in connectors - hub.streamnative.io
Source
Clickhouse
Sink
StreamNative Cloud
APP Layer
Application Messaging Data Pipelines Real-time Contextual Analytics
Micro
Service
Notification Dashboard Risk Control Auditing
Payment ETL
How Companies Use StreamNative today
Powered by Apache Pulsar, StreamNative provides a cloud-native,
real-time messaging and streaming platform to support multi-cloud
and hybrid cloud strategies.
Built for Containers
Cloud Native
StreamNative Cloud
Flink SQL
Fast Streaming into Clickhouse with Apache Pulsar
Demo
https://github.com/tspannhw/FLiP-Stream2Clickhouse/
streamnative.io
Key Takeaways
Clickhouse is fast and easy to
insert and query streaming data.
Apache Pulsar is a great option for
event-based streaming & storage.
Pulsar IO Sinks are easy.
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Clickhouse
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Streaming
Edge Gateway
Protocols
End-to-End Streaming FLiP(N) Apps
Micro
Service
IoT Data
IoT Ingestion: High-volume
streaming sources, sensors,
multiple message formats,
diverse protocols and
multi-vendor devices
creates data ingestion
challenges.
Other Sources: Transit data,
news, twitter, status feeds,
REST data, stock data and
more.
Build Cluster in Altinity Cloud
https://docs.altinity.com
Streaming Events into Altinity Cloud
https://docs.altinity.com/altinitycloud/quickstartguide/connectclient/
pulsar-io-jdbc-clickhouse-2.X.nar
CREATE TABLE iotjetsonjson_local
(
uuid String, camera String, ipaddress String, networktime String, top1pct String,
top1 String, cputemp String, gputemp String, gputempf String,
cputempf String, runtime String,
host String, filename String, host_name String, macaddress String,
te String, systemtime String, cpu String, diskusage String,
memory String, imageinput String
)
ENGINE = MergeTree()
PARTITION BY uuid
ORDER BY (uuid);
CREATE TABLE iotjetsonjson ON CLUSTER '{cluster}' AS iotjetsonjson_local
ENGINE = Distributed('{cluster}', default, iotjetsonjson_local, rand());
Streaming Events into Altinity Cloud
Clickhouse Altinity Cloud Table
Monitor The Cluster in Altinity Cloud
https://docs.altinity.com
Monitor The Cluster in Altinity Cloud
streamnative.io
streamnative.io
streamnative.io
streamnative.io
streamnative.io
streamnative.io
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
SOURCE CODE
DATE=$(date +"%Y-%m-%d_%H%M")
python3 -W ignore /home/nvidia/nvme/minifi-jetson-xavier/demo.py --camera /dev/video0 --network
googlenet /home/nvidia/nvme/images/$DATE.jpg 2>/dev/null
java -jar /home/nvidia/nvme/minifi-jetson-xavier/IoTProducer.jar --serviceUrl pulsar://pulsar1:6650
--topic 'persistent://public/default/iotjetsonjson' --message "`tail -1
/home/nvidia/nvme/logs/demo1.log`"
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://pulsar1.fios-router.home:4040
Spark context available as 'sc' (master = spark://pulsar1:7077, app id = app-20220204140604-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.2.0
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312)
val dfPulsar = spark.readStream
.format("pulsar")
.option("service.url", "pulsar://localhost:6650")
.option("admin.url", "http://localhost:8080")
.option("topic", "persistent://public/default/iotjetsonjson")
.load()
dfPulsar.printSchema()
root
|-- camera: string (nullable = true)
|-- cpu: double (nullable = false)
|-- cputemp: string (nullable = true)
|-- cputempf: string (nullable = true)
|-- diskusage: string (nullable = true)
|-- filename: string (nullable = true)
|-- gputemp: string (nullable = true)
|-- gputempf: string (nullable = true)
|-- host: string (nullable = true)
|-- host_name: string (nullable = true)
|-- imageinput: string (nullable = true)
|-- ipaddress: string (nullable = true)
|-- macaddress: string (nullable = true)
|-- memory: double (nullable = false)
|-- networktime: double (nullable = false)
|-- runtime: string (nullable = true)
|-- systemtime: string (nullable = true)
|-- te: string (nullable = true)
|-- top1: string (nullable = true)
|-- top1pct: double (nullable = false)
|-- uuid: string (nullable = true)
|-- __key: binary (nullable = true)
|-- __topic: string (nullable = true)
|-- __messageId: binary (nullable = true)
|-- __publishTime: timestamp (nullable = true)
|-- __eventTime: timestamp (nullable = true)
|-- __messageProperties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
val pQuery = dfPulsar.selectExpr("*")
.writeStream.format("console")
.option("truncate", "false").
start()
|camera |cpu |cputemp|cputempf|diskusage |filename
|gputemp|gputempf|host |host_name |imageinput |ipaddress
|macaddress |memory|networktime |runtime|systemtime |te |top1 |top1pct
|uuid |__key
|__topic |__messageId |__publishTime
|__eventTime|__messageProperties|
+-----------+----+-------+--------+----------+----------------------------------------------------------+-------+------
--+--------------+--------------+----------------------------------------------------------+-------------+-------------
----+------+------------------+-------+-------------------+----------------+-------------+-------------+---------------
-------------------+---------------------------------------------------------------------------------------------------
----+-----------------------------------------+-------------------------+-----------------------+-----------+----------
---------+
|/dev/video0|37.2|29.5 |85 |24512.6 MB|/home/nvidia/nvme/images/out_video0_fax_20220204191405.jpg|30.0 |86
|nvidia-desktop|nvidia-desktop|/home/nvidia/nvme/images/img_video0_cok_20220204191405.jpg|192.168.1.228|70:66:55:15:b4:
a5|77.9 |25.066272735595703|6 |02/04/2022 14:14:10|5.91699481010437|window
screen|26.8310546875|xav_uuid_video0_jfh_20220204191405|[78 61 76 5F 75 75 69 64 5F 76 69 64 65 6F 30 5F 6A 66 68 5F 32
30 32 32 30 32 30 34 31 39 31 34 30 35]|persistent://public/default/iotjetsonjson|[08 D1 A0 05 10 00 20 00]|2022-02-04
14:14:16.517|null |{} |
46
47
CREATE TABLE iotjetsonjson
(
`id` STRING, uuid STRING, ir STRING,
`end` STRING, lux STRING, gputemp STRING,
cputemp STRING, `te` STRING, systemtime STRING, hum STRING,
memory STRING, gas STRING, pressure STRING,
`host` STRING, diskusage STRING, ipaddress STRING, macaddress STRING,
gputempf STRING, host_name STRING, camera STRING, filename STRING,
`runtime` STRING, cpu STRING,cputempf STRING, imageinput STRING,
`networktime` STRING, top1 STRING, top1pct STRING,
publishTime TIMESTAMP(3) METADATA,
WATERMARK FOR publishTime AS publishTime - INTERVAL '5' SECOND
) WITH (
'connector' = 'pulsar',
'topic' = 'persistent://public/default/iotjetsonjson',
'value.format' = 'json',
'scan.startup.mode' = 'earliest',
'service-url' = 'pulsar://pulsar1:6650',
'admin-url' = 'http://pulsar1:8080'
);
48
bin/pulsar sql
show tables in pulsar."public/default";
select * from pulsar."public/default".iotjetsonjson
order by systemtime desc;
ARTICLE
Q&A
[Webinar]
Building Microservices
Watch Now Learn More
[Blog post]
Event-Driven Microservices
Now Available
On-Demand Pulsar
Training
Academy.StreamNative.io
52
Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PassDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw
1 of 53

Recommended

Building Real-Time Travel Alerts by
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
165 views48 slides
JConWorld_ Continuous SQL with Kafka and Flink by
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
156 views36 slides
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines by
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
150 views25 slides
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo by
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann
162 views8 slides
CoC23_ Looking at the New Features of Apache NiFi by
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiTimothy Spann
36 views24 slides
CoC23_ Let’s Monitor The Conditions at the Conference by
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceTimothy Spann
17 views17 slides

More Related Content

More from Timothy Spann

The Never Landing Stream with HTAP and Streaming by
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingTimothy Spann
254 views39 slides
Meetup - Brasil - Data In Motion - 2023 September 19 by
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Timothy Spann
319 views33 slides
Implement a Universal Data Distribution Architecture to Manage All Streaming ... by
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Timothy Spann
28 views56 slides
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data by
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
193 views45 slides
big data fest building modern data streaming apps by
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming appsTimothy Spann
317 views55 slides
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp by
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
163 views27 slides

More from Timothy Spann(20)

The Never Landing Stream with HTAP and Streaming by Timothy Spann
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann254 views
Meetup - Brasil - Data In Motion - 2023 September 19 by Timothy Spann
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann319 views
Implement a Universal Data Distribution Architecture to Manage All Streaming ... by Timothy Spann
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Timothy Spann28 views
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data by Timothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann193 views
big data fest building modern data streaming apps by Timothy Spann
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming apps
Timothy Spann317 views
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp by Timothy Spann
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Timothy Spann163 views
OSSNA Building Modern Data Streaming Apps by Timothy Spann
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann155 views
GSJUG: Mastering Data Streaming Pipelines 09May2023 by Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann255 views
BestInFlowCompetitionTutorials03May2023 by Timothy Spann
BestInFlowCompetitionTutorials03May2023BestInFlowCompetitionTutorials03May2023
BestInFlowCompetitionTutorials03May2023
Timothy Spann11 views
Cloudera Sandbox Event Guidelines For Workflow by Timothy Spann
Cloudera Sandbox Event Guidelines For WorkflowCloudera Sandbox Event Guidelines For Workflow
Cloudera Sandbox Event Guidelines For Workflow
Timothy Spann32 views
Meet the Committers Webinar_ Lab Preparation by Timothy Spann
Meet the Committers Webinar_ Lab PreparationMeet the Committers Webinar_ Lab Preparation
Meet the Committers Webinar_ Lab Preparation
Timothy Spann32 views
Best Practices For Workflow by Timothy Spann
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
Timothy Spann89 views
Meetup: Streaming Data Pipeline Development by Timothy Spann
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann337 views
DevNexus: Apache Pulsar Development 101 with Java by Timothy Spann
DevNexus:  Apache Pulsar Development 101 with JavaDevNexus:  Apache Pulsar Development 101 with Java
DevNexus: Apache Pulsar Development 101 with Java
Timothy Spann261 views
Conf42 Python_ ML Enhanced Event Streaming Apps with Python Microservices by Timothy Spann
Conf42 Python_ ML Enhanced Event Streaming Apps with Python MicroservicesConf42 Python_ ML Enhanced Event Streaming Apps with Python Microservices
Conf42 Python_ ML Enhanced Event Streaming Apps with Python Microservices
Timothy Spann443 views
ITPC Building Modern Data Streaming Apps by Timothy Spann
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
Timothy Spann797 views
PythonWebConference_ Cloud Native Apache Pulsar Development 202 with Python by Timothy Spann
PythonWebConference_ Cloud Native Apache Pulsar Development 202 with PythonPythonWebConference_ Cloud Native Apache Pulsar Development 202 with Python
PythonWebConference_ Cloud Native Apache Pulsar Development 202 with Python
Timothy Spann430 views
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java by Timothy Spann
PhillyJug  Getting Started With Real-time Cloud Native Streaming With JavaPhillyJug  Getting Started With Real-time Cloud Native Streaming With Java
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
Timothy Spann625 views
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud) by Timothy Spann
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
Timothy Spann18 views

Recently uploaded

Pydata Global 2023 - How can a learnt model unlearn something by
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingSARADINDU SENGUPTA
8 views13 slides
Lack of communication among family.pptx by
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptxahmed164023
16 views10 slides
CRM stick or twist workshop by
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
14 views16 slides
K-Drama Recommendation Using Python by
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using PythonFridaPutriassa
7 views20 slides
Oral presentation.pdf by
Oral presentation.pdfOral presentation.pdf
Oral presentation.pdfreemalmazroui8
5 views10 slides
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange by
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeAnalytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeRNayak3
5 views6 slides

Recently uploaded(20)

Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402316 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
K-Drama Recommendation Using Python by FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa7 views
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange by RNayak3
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeAnalytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
RNayak35 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204218 views
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson35 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4130 views
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning

Fast Streaming into Clickhouse with Apache Pulsar

  • 1. Fast Streaming into Clickhouse With Apache Pulsar
  • 2. Tim Spann Developer Advocate Tim Spann, Developer Advocate at StreamNative ● FLiP(N) Stack = Flink, Pulsar and NiFI Stack ● Streaming Systems & Data Architecture Expert ● Experience: ○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Python and more. ○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience at both global conferences and through individual conversations. https://streamnative.io/pulsar-python/
  • 3. FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  • 4. streamnative.io Agenda • What is Apache Pulsar? 2.10! • Code / Demonstration.
  • 5. Apache Pulsar is a Cloud-Native Messaging and Event-Streaming Platform.
  • 6. The right API for async 6 Designed for teams, with built in multi-tenancy Power and flexibility, w/ support for simultaneous streaming and messaging use cases Ideal for high-scale, mission critical microservices Easy to use, with a simple pub/sub API
  • 8. What are the Benefits of Pulsar? Data Durability Scalability Geo-Replication Multi-Tenancy Unified Messaging Model
  • 9. Pulsar’s Publish-Subscribe model Broker Subscription Consumer 1 Consumer 2 Consumer 3 Topic Producer 1 Producer 2 ● Producers send messages. ● Topics are an ordered, named channel that producers use to transmit messages to subscribed consumers. ● Messages belong to a topic and contain an arbitrary payload. ● Brokers handle connections and routes messages between producers / consumers. ● Subscriptions are named configuration rules that determine how messages are delivered to consumers. ● Consumers receive messages.
  • 10. Topics Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Instance Pulsar Cluster
  • 11. Pulsar subscription modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1, V 10 > < K 1, V 11 > < K 1, V 12 > < K 2 ,V 2 0 > < K 2 ,V 2 1> < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1, V 10 > < K 2, V 21 > < K 1, V 12 > < K 2 ,V 2 0 > < K 1, V 11 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  • 12. Pulsar Terminology Producer is a process that publishes messages to a topic. Consumer is a process that establishes a subscription to a topic and processes messages published to that topic. Subscription: A subscription is a named configuration rule that determines how messages are delivered to consumers. Brokers handle the connections and routes messages. Instance is a group of clusters that act together as a single unit. Cluster is a set of Pulsar brokers, ZooKeeper quorum, and an ensemble of BookKeeper bookies. Tenants are the administrative unit for allocating capacity and enforcing an authentication/ authorization scheme. Namespaces are a grouping mechanism for related topics. Topics are named channels for transmitting messages from producers to consumers. Messages belong to a topic and contain an arbitrary payload. BookKeeper log storage system that Pulsar uses for durable storage of all messages. Bookie Stores messages and cursors. Messages are grouped in segments/ledgers. ZooKeeper Stores metadata for both Pulsar and BookKeeper, also performs service discovery.
  • 18. Moving Data Out of Pulsar to Clickhouse IO/Connectors are a simple way to integrate with external systems and move data in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/ ● Built on top of Pulsar Functions ● Built-in connectors - hub.streamnative.io Source Clickhouse Sink
  • 20. APP Layer Application Messaging Data Pipelines Real-time Contextual Analytics Micro Service Notification Dashboard Risk Control Auditing Payment ETL How Companies Use StreamNative today
  • 21. Powered by Apache Pulsar, StreamNative provides a cloud-native, real-time messaging and streaming platform to support multi-cloud and hybrid cloud strategies. Built for Containers Cloud Native StreamNative Cloud Flink SQL
  • 24. streamnative.io Key Takeaways Clickhouse is fast and easy to insert and query streaming data. Apache Pulsar is a great option for event-based streaming & storage. Pulsar IO Sinks are easy.
  • 25. StreamNative Hub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Clickhouse Tiered Storage Pulsar --- KoP --- MoP --- Websocket --- HTTP Pulsar Sink Pulsar Sink Streaming Edge Gateway Protocols End-to-End Streaming FLiP(N) Apps Micro Service
  • 26. IoT Data IoT Ingestion: High-volume streaming sources, sensors, multiple message formats, diverse protocols and multi-vendor devices creates data ingestion challenges. Other Sources: Transit data, news, twitter, status feeds, REST data, stock data and more.
  • 27. Build Cluster in Altinity Cloud https://docs.altinity.com
  • 28. Streaming Events into Altinity Cloud https://docs.altinity.com/altinitycloud/quickstartguide/connectclient/ pulsar-io-jdbc-clickhouse-2.X.nar CREATE TABLE iotjetsonjson_local ( uuid String, camera String, ipaddress String, networktime String, top1pct String, top1 String, cputemp String, gputemp String, gputempf String, cputempf String, runtime String, host String, filename String, host_name String, macaddress String, te String, systemtime String, cpu String, diskusage String, memory String, imageinput String ) ENGINE = MergeTree() PARTITION BY uuid ORDER BY (uuid); CREATE TABLE iotjetsonjson ON CLUSTER '{cluster}' AS iotjetsonjson_local ENGINE = Distributed('{cluster}', default, iotjetsonjson_local, rand());
  • 29. Streaming Events into Altinity Cloud
  • 31. Monitor The Cluster in Altinity Cloud https://docs.altinity.com
  • 32. Monitor The Cluster in Altinity Cloud
  • 43. DATE=$(date +"%Y-%m-%d_%H%M") python3 -W ignore /home/nvidia/nvme/minifi-jetson-xavier/demo.py --camera /dev/video0 --network googlenet /home/nvidia/nvme/images/$DATE.jpg 2>/dev/null java -jar /home/nvidia/nvme/minifi-jetson-xavier/IoTProducer.jar --serviceUrl pulsar://pulsar1:6650 --topic 'persistent://public/default/iotjetsonjson' --message "`tail -1 /home/nvidia/nvme/logs/demo1.log`" Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://pulsar1.fios-router.home:4040 Spark context available as 'sc' (master = spark://pulsar1:7077, app id = app-20220204140604-0000). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 3.2.0 /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312)
  • 44. val dfPulsar = spark.readStream .format("pulsar") .option("service.url", "pulsar://localhost:6650") .option("admin.url", "http://localhost:8080") .option("topic", "persistent://public/default/iotjetsonjson") .load()
  • 45. dfPulsar.printSchema() root |-- camera: string (nullable = true) |-- cpu: double (nullable = false) |-- cputemp: string (nullable = true) |-- cputempf: string (nullable = true) |-- diskusage: string (nullable = true) |-- filename: string (nullable = true) |-- gputemp: string (nullable = true) |-- gputempf: string (nullable = true) |-- host: string (nullable = true) |-- host_name: string (nullable = true) |-- imageinput: string (nullable = true) |-- ipaddress: string (nullable = true) |-- macaddress: string (nullable = true) |-- memory: double (nullable = false) |-- networktime: double (nullable = false) |-- runtime: string (nullable = true) |-- systemtime: string (nullable = true) |-- te: string (nullable = true) |-- top1: string (nullable = true) |-- top1pct: double (nullable = false) |-- uuid: string (nullable = true) |-- __key: binary (nullable = true) |-- __topic: string (nullable = true) |-- __messageId: binary (nullable = true) |-- __publishTime: timestamp (nullable = true) |-- __eventTime: timestamp (nullable = true) |-- __messageProperties: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)
  • 46. val pQuery = dfPulsar.selectExpr("*") .writeStream.format("console") .option("truncate", "false"). start() |camera |cpu |cputemp|cputempf|diskusage |filename |gputemp|gputempf|host |host_name |imageinput |ipaddress |macaddress |memory|networktime |runtime|systemtime |te |top1 |top1pct |uuid |__key |__topic |__messageId |__publishTime |__eventTime|__messageProperties| +-----------+----+-------+--------+----------+----------------------------------------------------------+-------+------ --+--------------+--------------+----------------------------------------------------------+-------------+------------- ----+------+------------------+-------+-------------------+----------------+-------------+-------------+--------------- -------------------+--------------------------------------------------------------------------------------------------- ----+-----------------------------------------+-------------------------+-----------------------+-----------+---------- ---------+ |/dev/video0|37.2|29.5 |85 |24512.6 MB|/home/nvidia/nvme/images/out_video0_fax_20220204191405.jpg|30.0 |86 |nvidia-desktop|nvidia-desktop|/home/nvidia/nvme/images/img_video0_cok_20220204191405.jpg|192.168.1.228|70:66:55:15:b4: a5|77.9 |25.066272735595703|6 |02/04/2022 14:14:10|5.91699481010437|window screen|26.8310546875|xav_uuid_video0_jfh_20220204191405|[78 61 76 5F 75 75 69 64 5F 76 69 64 65 6F 30 5F 6A 66 68 5F 32 30 32 32 30 32 30 34 31 39 31 34 30 35]|persistent://public/default/iotjetsonjson|[08 D1 A0 05 10 00 20 00]|2022-02-04 14:14:16.517|null |{} | 46
  • 47. 47 CREATE TABLE iotjetsonjson ( `id` STRING, uuid STRING, ir STRING, `end` STRING, lux STRING, gputemp STRING, cputemp STRING, `te` STRING, systemtime STRING, hum STRING, memory STRING, gas STRING, pressure STRING, `host` STRING, diskusage STRING, ipaddress STRING, macaddress STRING, gputempf STRING, host_name STRING, camera STRING, filename STRING, `runtime` STRING, cpu STRING,cputempf STRING, imageinput STRING, `networktime` STRING, top1 STRING, top1pct STRING, publishTime TIMESTAMP(3) METADATA, WATERMARK FOR publishTime AS publishTime - INTERVAL '5' SECOND ) WITH ( 'connector' = 'pulsar', 'topic' = 'persistent://public/default/iotjetsonjson', 'value.format' = 'json', 'scan.startup.mode' = 'earliest', 'service-url' = 'pulsar://pulsar1:6650', 'admin-url' = 'http://pulsar1:8080' );
  • 48. 48 bin/pulsar sql show tables in pulsar."public/default"; select * from pulsar."public/default".iotjetsonjson order by systemtime desc;
  • 50. Q&A
  • 51. [Webinar] Building Microservices Watch Now Learn More [Blog post] Event-Driven Microservices
  • 53. Let’s Keep in Touch! Tim Spann Developer Advocate @PassDev https://www.linkedin.com/in/timothyspann https://github.com/tspannhw