http://blog.embian.com/74
Pulsar – an open-source, real-time analytics platform and stream processing framework. Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation over time windows, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering. Pulsar scales to a million events per second with high availability. It can be easily integrated with metrics stores like Cassandra and Druid.
2. 2
Agenda
1. What is Pulsar ?
2. Twitter stream processing demo
3. Key points
4. Other platforms
3. 3
1. What is Pulsar ?
● Developed by eBay
● Real-time analytics platform
● Stream processing framework
● Scalability
– Scale to tens of millions of events per second
● Availability
– No downtime during software upgrade, stream processing of
rules and topology changes
● Flexibility
– SQL-like language and annotations for defining stream
processing rules
4. 4
Pulsar's Building Block
(basic framework)
● Jetstream
– Real-time stream processing framework
– Spring IoC(Inversion of Control) container
6. 6
Pulsar's Building Block (Cont.)
(basic framework)
● Jetstream's key points
– CEP capabilities through Esper integration.
– Define processing logic in SQL
– Extends SQL functionality and pipeline flow routing using SQL
– Hot deploy SQL without restarting applications
– Spring IoC enabling dynamic topology changes at runtime
– Clustering with elastic scaling
– Cloud deployment
7. 7
Pulsar Real-time Analytics Pipeline
● Collector : Ingests events through a Rest end point
● Sessionizer : Sessionizes the events, maintaining the session state and generating marker events
● Distributor : Filters and mutates events to different consumers; acts as an event router
● Metrics Calculator : Calculates metrics by various dimensions and persists them in the metrics store
● Reply : Replays the failed events on other stages
● ConfigApp : Configures dynamic provisioning for the whole pipeline
8. 8
1) Collector
● Supports REST API to ingest events
● Geo and device classification enrichment
● Detects fraud and bot
● Streams the enriched events to Sessionizer stage
PulsarRawEvent:A
“si”: “UUID”
"ipv4": "ip",
...
"itmP":”itmPrice”,
"capQ":”cmapaignQuantity”
PulsarEvent:A
“device” : “deviceinfo”,
“geo” : “geoinfo”,
“raw” : “PulsarRawEvent:A”
Enrichment
9. 9
2) Sessionizer
● A process of temporal grouping of events
containing a specific identifier referred to as session
duration
● Session metadata and state
● Session store (in-memory cache)
PulsarEvent:A
“device” : “deviceinfo”,
“geo” : “geoinfo”,
“raw” : “PulsarRawEvent:A”
Sessionization
PulsarEvent:A
“device” : “deviceinfo”,
“geo” : “geoinfo”,
“raw” : “PulsarRawEvent:A”
Metadata:A
sessionId,
PageId,
geo-loc,
device,
etc..
14. 14
EPLs (Context)
context MCContext
insert into TwitterTopCountryCount
Select count(*) as count, country from TwitterSample(country is not null) group by country output
snapshot when terminated order by count(*) desc limit 10;
context MCContext
insert into TwitterTopLangCount
Select count(*) as count, lang from TwitterSample(lang is not null) group by lang output snapshot when
terminated order by count(*) desc limit 10;
context MCContext
insert into TwitterTopHashTagCount
Select topKNested(1000, 20, hashtag, ',') as TopHashTag from TwitterSample(hashtag is not null) output
snapshot when terminated;
context MCContext
insert into TwitterEventCount
Select count(*) as count from TwitterSample output snapshot when terminated;
15. 15
EPLs (Select)
@BroadCast
@OutputTo("OutboundMessageChannel")
@PublishOn(topics="Pulsar.Report/metric")
select * from TwitterTopCountryCount;
@BroadCast
@OutputTo("OutboundMessageChannel")
@PublishOn(topics="Pulsar.Report/metric")
select * from TwitterTopLangCount;
@BroadCast
@OutputTo("OutboundMessageChannel")
@PublishOn(topics="Pulsar.Report/metric")
select * from TwitterTopHashTagCount;
@BroadCast
@OutputTo("OutboundMessageChannel")
@PublishOn(topics="Pulsar.Report/metric")
select * from TwitterEventCount;
17. 17
3. Pulsar's key points
● Creating pipelines declaratively
● SQL driven processing logic with hot deployment of SQL
● Framework for custom SQL extensions
● Dynamic partitioning and flow control
● < 100 millisecond pipeline latency
● 99.99% Availability
● < 0.01% data loss
● Cloud deployable
18. 18
4. Other Stream Processing Frameworks
● Storm(Trident)
– Storm Transactional Topology
– Stateful
● Storm(Esper)
– Our solution developed in NexR Project
– Integrates Esper
● Apache Spark
– Fast and general cluster computing platform for Big Data
– Support SQL
19. 19
Storm(+Esper) / Spark vs Pulsar
Points Pulsar Storm(Trident) Storm(Esper) Spark
Declarative pipeline wiring O X X X
Pipeline stitching Run time Build time Build time Build time
Hot deployment of
topologies
O X X X
SQL support O X O O
Hot deployment of
processing rules
O X O X
Pipeline flow control O △ △ ?
Stateful processing O O △ O
<http://gopulsar.io/docs/Pulsar_Presentation.pdf>