Pulsar Virtual Summit North America 2021
Why Micro Focus Selected Pulsar for
Data Ingestion?
Pulsar Virtual Summit North America 2021
Srikanth Natarajan
Fellow, CTO, IT Operations Management
Product Group, Micro Focus
Speaker Bio
Srikanth Natarajan is an experienced technology leader in the IT
Operations Management software industry, a Micro Focus Fellow and
CTO for the ITOM Product Group, and a former HP Distinguished
Technologist. He has engineered many successful products, led multiple
architectural transformations of products, and most recently, led the
transformation of a major product portfolio into a modern
containerized/cloud native architecture. He has also been responsible for
the recent market introduction of the OPTIC Data Lake from Micro
Focus. He has been granted over 20 patents for his contributions to
various inventions. He lives in Fort Collins, Colorado.
Pulsar Virtual Summit North America 2021
Agenda
This session will cover the experience of Micro Focus in
consuming from and contributing to Apache Pulsar, the lessons
learned, and the collaboration with a development support partner
to help us in the journey.
Pulsar Virtual Summit North America 2021
Micro Focus IT Operations Management
(ITOM) Context
• Large portfolio of operations management products generating many forms of
data
• Had a need to process and store a variety of operational data in near real time
• Data is time series primarily, structured and semi-structured
• Data needed to be processed while in motion and also at rest
• Micro Focus has Vertica1 technology for long term storage/analytics but
needed a streaming engine as well for real time transport and analysis of
data
1https://www.vertica.com/
Pulsar Virtual Summit North America 2021
Our Technical Requirements of Streaming
Engine
• Enterprise/SaaS ready
• Scalable, multi-tenant, durable, and extensible, and easy to productize
• Observable
• Low latency and high throughput across a variety of data
• Tiered storage support
• Easier to deploy and operate in production in a container form running in a
Kubernetes cluster without any professional services support
• Ready to use, simple to deploy and easy to operate in both cloud and on-
prem
Apache Pulsar Provided Us A Great Start. We Integrated it with Vertica
and Created the Micro Focus OPTIC2 Data Lake
2https://community.microfocus.com/it_ops_mgt/b/sws-571/posts/announcing-optic---the-operations-
platform-for-transformation-intelligence-and-cloud
Pulsar Virtual Summit North America 2021
Vertica Cluster
BI Tools
Kubernetes Cluster
1
Data Input
(streaming)
ITOM Capabilities
(ITOM internal)
REST API Layer for data access
REST API
Messaging Bus
(Pulsar based)
Scalable | Durable | Available | Observable
| Multi-tenancy | ...
Brokering
Storage
Big Data | Analytics |
Database
Streaming Pipeline
Express Pipeline
2
Batch Input
(Express Load)
Bulk load
(S3 based)
Object Store | Amazon S3 Compatible
Data Processing
(Flink based)
Baselining Forecasting Aggregation
Scalable | Extensible| Distributed
Advanced Event Correlation
processing
data in motion
processing
data at rest
High Level Architecture of Our OPTIC Data Lake
Vertica Ingestion Micro Service
Pulsar Proxy
Vertica
Broker
7. Read messages (
Reader API) and Store
in DB
6. Invoke COPY(Load)
command
Vertica Scheduler
5. Get Backlog
Bookkeeper
Zookkeeper
Administration
4. Push
Configuration
9. Update Cursor
of subscription
Pulsar
Client
Data Sources
HTTP
Receiver
8. Send load Status
Config
Client
1. Configure
Streaming
2. Create Topic and
Subscription
3.. Stream
Data
Pulsar Virtual Summit North America 2021
Scheduler Message Streaming Overview
Administration Service
Scheduler for Pulsar
Streaming
Pulsar Message Bus
0. Get Config
2. Get Backlog
Vertica 3 Node Cluster
Vertica Node 1
Data
UDx (reader)
Vertica Node 2
Data
UDx (reader)
Vertica Node 3
Data
UDx (reader)
Producer 1
Data Collector
Producer 2
Data Collector
Producer n
Data Collector
Receiver
Reader
Pulsar readers are message processors
much like Pulsar consumers but with
two crucial differences:
• you can specify where on a topic
readers begin processing messages
(consumers always begin with the latest
available unacked message);
• readers don't retain data or
acknowledge messages.
3. Scheduler periodically (Frame
by Frame) schedules the µBs
(micro batches) COPY commands
and asks UDX to read the
messages from pulsar for a
TOPIC
1. Message Ingestion
UDx: User-Defined Extensions
Pulsar Virtual Summit North America 2021
9
Use Case: Event Correlation Using Pulsar and Flink
Flink Job
Manager
Flink Task Manager
Auto Event
Correlation
Pulsar
Sink
Connector
Pulsar
Source
Connector
Vertica Database
Apache Pulsar
Raw Event
Topic
Data
Source
s
HTTP
Receiv
er
Raw Event
Correlated
Event
Correlated
Event Topic
Internal Notification
Service
Micro Focus
Operations Bridge
Manager ( Event
Manager)
ML Artifacts
Pulsar Virtual Summit North America 2021
● We verified in our lab today ingesting streams at approx. 100MB/s or
360 GB/Hr or 8.64 TB/day with our default set up of Kubernetes
worker nodes
● We plan to double this in the near future
● With the linear scalability we should be able support much more with
additional resources -added proportionally.
10
Our Results
Pulsar Virtual Summit North America 2021
● Issues Resolved – across different versions of connector and Pulsar
○ Data loss observed when using Pulsar Flink connector ( three releases had this issue with
multiple root causes)
○ Resolution for Thread leaks in Pulsar Flink connector
○ Flink connector stops streaming if topic is recreated or when state is restored in
Backup/Restore and DR scenarios
○ Security fixes for all the STAT issues observed ( last two releases with quick turnaround )
○ NullPointer Exception in Flink Connector
● Additional Help/Tools
○ Dynamic linking of libraries in Pulsar C++ client
○ Multiple CA certificate support in Pulsar
○ Formula to compute storage given ingestion rate
○ Performance tuning across cloud and on-prem deployment
○ State Migration Utility from SN helped to migrate while upgrading from Flink 1.9 to Flink
1.11
11
Support from Stream Native
● NullPointerException in Broker, workaround provided yet to validate
● Backup/Restore and DR Use case not working when using Flink
connector 2.4.28.4
● ER: Different TLS certificate configuration for Geo replication. PR
already created by Sijie (https://github.com/apache/pulsar/pull/10710)
12
Current Open Issues
Note: In the session video recording, there was a reference to a Micro Focus internal page that contains the
issues listed above. It was recorded in error. Please ignore that aspect when you listen to the recording.
Pulsar Virtual Summit North America 2021
Thank You.

Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021

  • 1.
    Pulsar Virtual SummitNorth America 2021 Why Micro Focus Selected Pulsar for Data Ingestion?
  • 2.
    Pulsar Virtual SummitNorth America 2021 Srikanth Natarajan Fellow, CTO, IT Operations Management Product Group, Micro Focus Speaker Bio Srikanth Natarajan is an experienced technology leader in the IT Operations Management software industry, a Micro Focus Fellow and CTO for the ITOM Product Group, and a former HP Distinguished Technologist. He has engineered many successful products, led multiple architectural transformations of products, and most recently, led the transformation of a major product portfolio into a modern containerized/cloud native architecture. He has also been responsible for the recent market introduction of the OPTIC Data Lake from Micro Focus. He has been granted over 20 patents for his contributions to various inventions. He lives in Fort Collins, Colorado.
  • 3.
    Pulsar Virtual SummitNorth America 2021 Agenda This session will cover the experience of Micro Focus in consuming from and contributing to Apache Pulsar, the lessons learned, and the collaboration with a development support partner to help us in the journey.
  • 4.
    Pulsar Virtual SummitNorth America 2021 Micro Focus IT Operations Management (ITOM) Context • Large portfolio of operations management products generating many forms of data • Had a need to process and store a variety of operational data in near real time • Data is time series primarily, structured and semi-structured • Data needed to be processed while in motion and also at rest • Micro Focus has Vertica1 technology for long term storage/analytics but needed a streaming engine as well for real time transport and analysis of data 1https://www.vertica.com/
  • 5.
    Pulsar Virtual SummitNorth America 2021 Our Technical Requirements of Streaming Engine • Enterprise/SaaS ready • Scalable, multi-tenant, durable, and extensible, and easy to productize • Observable • Low latency and high throughput across a variety of data • Tiered storage support • Easier to deploy and operate in production in a container form running in a Kubernetes cluster without any professional services support • Ready to use, simple to deploy and easy to operate in both cloud and on- prem Apache Pulsar Provided Us A Great Start. We Integrated it with Vertica and Created the Micro Focus OPTIC2 Data Lake 2https://community.microfocus.com/it_ops_mgt/b/sws-571/posts/announcing-optic---the-operations- platform-for-transformation-intelligence-and-cloud
  • 6.
    Pulsar Virtual SummitNorth America 2021 Vertica Cluster BI Tools Kubernetes Cluster 1 Data Input (streaming) ITOM Capabilities (ITOM internal) REST API Layer for data access REST API Messaging Bus (Pulsar based) Scalable | Durable | Available | Observable | Multi-tenancy | ... Brokering Storage Big Data | Analytics | Database Streaming Pipeline Express Pipeline 2 Batch Input (Express Load) Bulk load (S3 based) Object Store | Amazon S3 Compatible Data Processing (Flink based) Baselining Forecasting Aggregation Scalable | Extensible| Distributed Advanced Event Correlation processing data in motion processing data at rest High Level Architecture of Our OPTIC Data Lake
  • 7.
    Vertica Ingestion MicroService Pulsar Proxy Vertica Broker 7. Read messages ( Reader API) and Store in DB 6. Invoke COPY(Load) command Vertica Scheduler 5. Get Backlog Bookkeeper Zookkeeper Administration 4. Push Configuration 9. Update Cursor of subscription Pulsar Client Data Sources HTTP Receiver 8. Send load Status Config Client 1. Configure Streaming 2. Create Topic and Subscription 3.. Stream Data
  • 8.
    Pulsar Virtual SummitNorth America 2021 Scheduler Message Streaming Overview Administration Service Scheduler for Pulsar Streaming Pulsar Message Bus 0. Get Config 2. Get Backlog Vertica 3 Node Cluster Vertica Node 1 Data UDx (reader) Vertica Node 2 Data UDx (reader) Vertica Node 3 Data UDx (reader) Producer 1 Data Collector Producer 2 Data Collector Producer n Data Collector Receiver Reader Pulsar readers are message processors much like Pulsar consumers but with two crucial differences: • you can specify where on a topic readers begin processing messages (consumers always begin with the latest available unacked message); • readers don't retain data or acknowledge messages. 3. Scheduler periodically (Frame by Frame) schedules the µBs (micro batches) COPY commands and asks UDX to read the messages from pulsar for a TOPIC 1. Message Ingestion UDx: User-Defined Extensions
  • 9.
    Pulsar Virtual SummitNorth America 2021 9 Use Case: Event Correlation Using Pulsar and Flink Flink Job Manager Flink Task Manager Auto Event Correlation Pulsar Sink Connector Pulsar Source Connector Vertica Database Apache Pulsar Raw Event Topic Data Source s HTTP Receiv er Raw Event Correlated Event Correlated Event Topic Internal Notification Service Micro Focus Operations Bridge Manager ( Event Manager) ML Artifacts
  • 10.
    Pulsar Virtual SummitNorth America 2021 ● We verified in our lab today ingesting streams at approx. 100MB/s or 360 GB/Hr or 8.64 TB/day with our default set up of Kubernetes worker nodes ● We plan to double this in the near future ● With the linear scalability we should be able support much more with additional resources -added proportionally. 10 Our Results
  • 11.
    Pulsar Virtual SummitNorth America 2021 ● Issues Resolved – across different versions of connector and Pulsar ○ Data loss observed when using Pulsar Flink connector ( three releases had this issue with multiple root causes) ○ Resolution for Thread leaks in Pulsar Flink connector ○ Flink connector stops streaming if topic is recreated or when state is restored in Backup/Restore and DR scenarios ○ Security fixes for all the STAT issues observed ( last two releases with quick turnaround ) ○ NullPointer Exception in Flink Connector ● Additional Help/Tools ○ Dynamic linking of libraries in Pulsar C++ client ○ Multiple CA certificate support in Pulsar ○ Formula to compute storage given ingestion rate ○ Performance tuning across cloud and on-prem deployment ○ State Migration Utility from SN helped to migrate while upgrading from Flink 1.9 to Flink 1.11 11 Support from Stream Native
  • 12.
    ● NullPointerException inBroker, workaround provided yet to validate ● Backup/Restore and DR Use case not working when using Flink connector 2.4.28.4 ● ER: Different TLS certificate configuration for Geo replication. PR already created by Sijie (https://github.com/apache/pulsar/pull/10710) 12 Current Open Issues Note: In the session video recording, there was a reference to a Micro Focus internal page that contains the issues listed above. It was recorded in error. Please ignore that aspect when you listen to the recording.
  • 13.
    Pulsar Virtual SummitNorth America 2021 Thank You.

Editor's Notes