This document provides an overview of building a real-time analytics application with Apache Pulsar and Apache Pinot. It introduces Mary Grygleski and Mark Needham, describes what real-time analytics is, and discusses the properties of real-time analytics systems. It then demonstrates how to ingest data from the Wikimedia recent changes feed into Pulsar and Pinot for real-time analytics and builds a dashboard with the data using Streamlit.
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
1. Building a Real-Time Analytics
Application with
Apache Pulsar and Apache Pinot
Mark Needham
@MarkHNeedham
15th November 2022
Mary Grygleski
@mgrygles
2. Mary Grygleski
The Passionate Developer Advocate
Mary is a Streaming Developer Advocate at DataStax, a
leading Data Management Company that specializes in
Database-as-a-Service, NoSQL, Big Data, Streaming, and
the Cloud-Native platform. Previously she was with the
Java and WebSphere/Open Source Advocacy team at
IBM.
Based out of Chicago, Mary is a Java Champion and
President and Executive Board Member of the Chicago
Java Users Group (CJUG). She is also co-organizers for
the Data, Cloud and AI In Chicago, Chicago Cloud, and
IBM Cloud Chicago meetup groups.
She has extensive experience in product and application
design, development, integration, and deployment
experience, and specializes in Event-driven, Reactive
Java, Open Source, and Cloud-enabled Distributed
systems.
https://www.linkedin.com/in/mary-grygleski/
@mgrygles
https://www.twitch.tv/mgrygles
https://discord.gg/RMU4Juw
Who is Mary?
3. Mark Needham
Developer Relations Engineer
Mark Needham is an Apache Pinot advocate and
developer relations engineer at StarTree.
As a developer relations engineer, Mark helps users
learn how to use Apache Pinot to build their real-time
user-facing analytics applications. He also does
developer experience, simplifying the getting started
experience by making product tweaks and
improvements to the documentation.
Mark writes about his experiences working with Pinot at
markhneedham.com.
https://www.linkedin.com/in/markhneedham/
@markhneedham
Who is Mark?
https://www.markhneedham.com/blog/
learndatawithmark.com
4. What is Real-Time Analytics?
Real-time analytics is the discipline that applies logic and mathematics
to data to provide insights for making better decisions quickly.
16. Building a User-facing Real-Time Analytics System
Velocity of
ingestion
Real-Time
Ingestion
1000s of QPS
Milliseconds
Latency
Seconds
Freshness
Highly
Available Scalable
Cost
Effective
High
Dimensionality
18. 18
Open source
Created by Yahoo
Contributed to the Apache Software Foundation (ASF) in 2016
Top-level project (2018)
Cloud-native design
Cluster based
Multi-tenant
Simple client APIs (Java, C#, Python, Go, …)
➔ Separate compute and storage!
Guaranteed message delivery
If a message successfully reaches a Pulsar broker, it will be delivered to its
intended target.
Light-weight serverless functions framework
Create complex processing logic within a Pulsar cluster (aka: data
pipeline)
Tiered storage offloads
Offload data from hot/warm storage to cold/long-term storage when the
data is aging out
Meet
Pulsar
19. 19
Streaming
Ingest data Sink data Select data
Process data
Not Streaming
Ingest
data
Persist
data
Select
data
Process
data
Streaming versus not streaming
Persist
data
Select
data
26. Our data set: Wikimedia Recent Changes Feed
● A continuous stream of structured event data
describing changes made to Wikimedia properties.
● Published over HTTP using the Server-Side Events
(SSE) Protocol.
27. Wikimedia Recent Changes Feed events
event: message
id:
[{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":1647344554001},{"topic":"codfw.me
diawiki.recentchange","partition":0,"offset":-1}]
data:
{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://en.wikipedia.org/wiki/Bosmansdam_H
igh_School","request_id":"f72015bb-376c-48b9-9863-afc0c75a72c8","id":"99c272ae-d31c-4535-9dac-69b098
3171d6","dt":"2022-03-15T11:42:34Z","domain":"en.wikipedia.org","stream":"mediawiki.recentchange","t
opic":"eqiad.mediawiki.recentchange","partition":0,"offset":3714501013},"id":1485381286,"type":"edit
","namespace":0,"title":"Bosmansdam High School","comment":"v2.04b - Fix errors for [[WP:WCW|CW
project]] (Template value ends with break)","timestamp":1647344554,"user":"ZI
Jony","bot":false,"minor":true,"length":{"old":16089,"new":16085},"revision":{"old":1075262250,"new"
:1077261343},"server_url":"https://en.wikipedia.org","server_name":"en.wikipedia.org","server_script
_path":"/w","wiki":"enwiki","parsedcomment":"v2.04b - Fix errors for <a href="/wiki/Wikipedia:WCW"
class="mw-redirect" title="Wikipedia:WCW">CW project</a> (Template value ends with break)"}
28. Wikimedia Recent Changes Feed events
event: message
id:
[{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":1647344554001},{"topic":"codfw.me
diawiki.recentchange","partition":0,"offset":-1}]
data:
{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://en.wikipedia.org/wiki/Bosmansdam_H
igh_School","request_id":"f72015bb-376c-48b9-9863-afc0c75a72c8","id":"99c272ae-d31c-4535-9dac-69b098
3171d6","dt":"2022-03-15T11:42:34Z","domain":"en.wikipedia.org","stream":"mediawiki.recentchange","t
opic":"eqiad.mediawiki.recentchange","partition":0,"offset":3714501013},"id":1485381286,"type":"edit
","namespace":0,"title":"Bosmansdam High School","comment":"v2.04b - Fix errors for [[WP:WCW|CW
project]] (Template value ends with break)","timestamp":1647344554,"user":"ZI
Jony","bot":false,"minor":true,"length":{"old":16089,"new":16085},"revision":{"old":1075262250,"new"
:1077261343},"server_url":"https://en.wikipedia.org","server_name":"en.wikipedia.org","server_script
_path":"/w","wiki":"enwiki","parsedcomment":"v2.04b - Fix errors for <a href="/wiki/Wikipedia:WCW"
class="mw-redirect" title="Wikipedia:WCW">CW project</a> (Template value ends with break)"}
32. Takeaways
● Real-time analytics lets us create applications that give users
actionable insights
● Properties of these systems: Fresh data, fast querying, at scale
● Pulsar + Pinot is the perfect combination to achieve this