Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Building a Real-Time Analytics
Application with
Apache Pulsar and Apache Pinot
Mark Needham
@MarkHNeedham
15th November 2022
Mary Grygleski
@mgrygles

Mary Grygleski
The Passionate Developer Advocate
Mary is a Streaming Developer Advocate at DataStax, a
leading Data Management Company that specializes in
Database-as-a-Service, NoSQL, Big Data, Streaming, and
the Cloud-Native platform. Previously she was with the
Java and WebSphere/Open Source Advocacy team at
IBM.
Based out of Chicago, Mary is a Java Champion and
President and Executive Board Member of the Chicago
Java Users Group (CJUG). She is also co-organizers for
the Data, Cloud and AI In Chicago, Chicago Cloud, and
IBM Cloud Chicago meetup groups.
She has extensive experience in product and application
design, development, integration, and deployment
experience, and specializes in Event-driven, Reactive
Java, Open Source, and Cloud-enabled Distributed
systems.
https://www.linkedin.com/in/mary-grygleski/
@mgrygles
https://www.twitch.tv/mgrygles
https://discord.gg/RMU4Juw
Who is Mary?

Mark Needham
Developer Relations Engineer
Mark Needham is an Apache Pinot advocate and
developer relations engineer at StarTree.
As a developer relations engineer, Mark helps users
learn how to use Apache Pinot to build their real-time
user-facing analytics applications. He also does
developer experience, simplifying the getting started
experience by making product tweaks and
improvements to the documentation.
Mark writes about his experiences working with Pinot at
markhneedham.com.
https://www.linkedin.com/in/markhneedham/
@markhneedham
Who is Mark?
https://www.markhneedham.com/blog/
learndatawithmark.com

What is Real-Time Analytics?
Real-time analytics is the discipline that applies logic and mathematics
to data to provide insights for making better decisions quickly.

Events -> Insight
Events Insight

Events -> Insight -> Action
Events Insight Action

The value of data over time
Time
Value

Time
Value
Real-Time

Time
Value
Real-Time
Who’s interested in this data?
● Analysts
● Management
● Users

Real-Time Analytics Quadrant
Human Facing
Machine Facing
Internal External
Observability
Real-Time
Dashboard
Recommendation Engine
Fraud Detection
Order Tracking Service

Total users 700 Million
QPS 10000+
Latency SLA < 100 ms p99th
Freshness Seconds
Examples of Real-Time Analytics

Missed
orders
Inaccurate
orders
Downtime
Top selling
items
Menu item
Feedback
Total users 500,000+
QPS 100s
Latency SLA < 100 ms p99th
Freshness Seconds - Minutes

Source:
Peter Bakkum, Engineering Manager @Stripe Financial

Properties of Real-Time Analytics Systems

Building a User-facing Real-Time Analytics System
Velocity of
ingestion
Real-Time
Ingestion
1000s of QPS
Milliseconds
Latency
Seconds
Freshness
Highly
Available Scalable
Cost
Effective
High
Dimensionality

18
Open source
Created by Yahoo
Contributed to the Apache Software Foundation (ASF) in 2016
Top-level project (2018)
Cloud-native design
Cluster based
Multi-tenant
Simple client APIs (Java, C#, Python, Go, …)
➔ Separate compute and storage!
Guaranteed message delivery
If a message successfully reaches a Pulsar broker, it will be delivered to its
intended target.
Light-weight serverless functions framework
Create complex processing logic within a Pulsar cluster (aka: data
pipeline)
Tiered storage oﬄoads
Oﬄoad data from hot/warm storage to cold/long-term storage when the
data is aging out
Meet
Pulsar

19
Streaming
Ingest data Sink data Select data
Process data
Not Streaming
Ingest
data
Persist
data
Select
data
Process
data
Streaming versus not streaming
Persist
data
Select
data

S1 S3
Pinot
Controller
S2
3
1 2
2 3
4
Pinot Servers
Zookeeper
Pinot
Broker
S4
4
1
Seg1 -> S1
Seg2 -> S2
Seg3 -> S3
Seg4 -> S4
Seg1 -> S1, S4
Seg2 -> S2, S3
Seg3 -> S3, S1
Seg4 -> S4, S2
select count(*) from X
where country = us
Apache Pinot Architecture

github.com/mneedham/pinot-wiki/tree/pulsar

Our data set: Wikimedia Recent Changes Feed
● A continuous stream of structured event data
describing changes made to Wikimedia properties.
● Published over HTTP using the Server-Side Events
(SSE) Protocol.

Wikimedia Recent Changes Feed events
event: message
id:
[{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":1647344554001},{"topic":"codfw.me
diawiki.recentchange","partition":0,"offset":-1}]
data:
{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://en.wikipedia.org/wiki/Bosmansdam_H
igh_School","request_id":"f72015bb-376c-48b9-9863-afc0c75a72c8","id":"99c272ae-d31c-4535-9dac-69b098
3171d6","dt":"2022-03-15T11:42:34Z","domain":"en.wikipedia.org","stream":"mediawiki.recentchange","t
opic":"eqiad.mediawiki.recentchange","partition":0,"offset":3714501013},"id":1485381286,"type":"edit
","namespace":0,"title":"Bosmansdam High School","comment":"v2.04b - Fix errors for [[WP:WCW|CW
project]] (Template value ends with break)","timestamp":1647344554,"user":"ZI
Jony","bot":false,"minor":true,"length":{"old":16089,"new":16085},"revision":{"old":1075262250,"new"
:1077261343},"server_url":"https://en.wikipedia.org","server_name":"en.wikipedia.org","server_script
_path":"/w","wiki":"enwiki","parsedcomment":"v2.04b - Fix errors for <a href="/wiki/Wikipedia:WCW"
class="mw-redirect" title="Wikipedia:WCW">CW project</a> (Template value ends with break)"}

Powered by Apache Pinot
3.9k
Github Stars
Slack Users
Companies
2400+
100+
Community
Events/sec
1M+ Peak QPS
200k+ Query Latency
ms
Performance
pinot.apache.org

Takeaways
● Real-time analytics lets us create applications that give users
actionable insights
● Properties of these systems: Fresh data, fast querying, at scale
● Pulsar + Pinot is the perfect combination to achieve this

Thank you! (from Mark) 🙇
dev.startree.ai
@MarkHNeedham
stree.ai/slack
@learndatawithmark

Thank you! (from Mary) 󰢚
@mgrygles
Apache Pulsar Slack sign-up
https://apache-pulsar.herokuapp.com/
https://pulsar-neighborhood.github.io/

Resources
Astra DB: https://astra.datastax.com
Astra Streaming:
https://www.datastax.com/products/astra-streaming
Luna Streaming:
https://www.datastax.com/products/luna-streaming
CDC for Astra DB:
https://docs.datastax.com/en/astra/docs/astream-cdc.html
https://pulsar.apache.org/
https://bookkeeper.apache.org/
https://zookeeper.apache.org

Check out 5 Minutes About Pulsar on
https://bit.ly/3bgkRxJ

How to start coding ?
Check out Awesome-Astra
https://awesome-astra.github.io/docs/

Follow Mary’s Twitch Stream
(Different topics: Java, Open Source, Distributed Messaging, Event-Streaming, Cloud, DevOps, etc)
Wednesday at 2pm-US/CST
https://twitch.tv/mgrygles

Creating Pinot Table
docker exec -it pinot-controller-wiki bin/pinot-admin.sh
AddTable
-tableConfigFile /config/table.json
-schemaFile /config/schema.json
-exec

Streamlit Dashboard: Top Users

Streamlit Dashboard: Top Bots/Non Bots

Streamlit Dashboard: What got changed?

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Similar to Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot (20)

More from Altinity Ltd

More from Altinity Ltd (20)

Recently uploaded

Recently uploaded (20)

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot