Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

eBay Pulsar: Real-time analytics platform


Published on

Pulsar – an open-source, real-time analytics platform and stream processing framework. Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation over time windows, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering. Pulsar scales to a million events per second with high availability. It can be easily integrated with metrics stores like Cassandra and Druid.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

eBay Pulsar: Real-time analytics platform

  1. 1. eBay Pulsar (Real-time Analytics Platform) 2015.03.13 양경모
  2. 2. 2 Agenda 1. What is Pulsar ? 2. Twitter stream processing demo 3. Key points 4. Other platforms
  3. 3. 3 1. What is Pulsar ? ● Developed by eBay ● Real-time analytics platform ● Stream processing framework ● Scalability – Scale to tens of millions of events per second ● Availability – No downtime during software upgrade, stream processing of rules and topology changes ● Flexibility – SQL-like language and annotations for defining stream processing rules
  4. 4. 4 Pulsar's Building Block (basic framework) ● Jetstream – Real-time stream processing framework – Spring IoC(Inversion of Control) container
  5. 5. 5 Pulsar's Building Block (Cont.) (basic framework)
  6. 6. 6 Pulsar's Building Block (Cont.) (basic framework) ● Jetstream's key points – CEP capabilities through Esper integration. – Define processing logic in SQL – Extends SQL functionality and pipeline flow routing using SQL – Hot deploy SQL without restarting applications – Spring IoC enabling dynamic topology changes at runtime – Clustering with elastic scaling – Cloud deployment
  7. 7. 7 Pulsar Real-time Analytics Pipeline ● Collector : Ingests events through a Rest end point ● Sessionizer : Sessionizes the events, maintaining the session state and generating marker events ● Distributor : Filters and mutates events to different consumers; acts as an event router ● Metrics Calculator : Calculates metrics by various dimensions and persists them in the metrics store ● Reply : Replays the failed events on other stages ● ConfigApp : Configures dynamic provisioning for the whole pipeline
  8. 8. 8 1) Collector ● Supports REST API to ingest events ● Geo and device classification enrichment ● Detects fraud and bot ● Streams the enriched events to Sessionizer stage PulsarRawEvent:A “si”: “UUID” "ipv4": "ip", ... "itmP":”itmPrice”, "capQ":”cmapaignQuantity” PulsarEvent:A “device” : “deviceinfo”, “geo” : “geoinfo”, “raw” : “PulsarRawEvent:A” Enrichment
  9. 9. 9 2) Sessionizer ● A process of temporal grouping of events containing a specific identifier referred to as session duration ● Session metadata and state ● Session store (in-memory cache) PulsarEvent:A “device” : “deviceinfo”, “geo” : “geoinfo”, “raw” : “PulsarRawEvent:A” Sessionization PulsarEvent:A “device” : “deviceinfo”, “geo” : “geoinfo”, “raw” : “PulsarRawEvent:A” Metadata:A sessionId, PageId, geo-loc, device, etc..
  10. 10. 10 3) Distributor ● Event filtering, mutation and routing distributes PulsarEvent:A “si” : “AAAAAA”, “device” : “deviceinfo”, “geo” : “geoinfo”, “raw” : “PulsarRawEvent:A” @OutputTo("OutboundMessageChannel") @ClusterAffinityTag(colname="si") @PublishOn(topics="Pulsar.MC/ssnzEvent") select * from PulsarEvent; Outbound Message Channel Inbound Message Channel Inbound Message Channel "Pulsar.MC/ssnzEvent" PulsarEvent:B “si” : “BBBBBB”, “device” : “deviceinfo”, “geo” : “geoinfo”, “raw” : “PulsarRawEvent:A”
  11. 11. 11 ● Real-time metrics computation engine(Esper) ● Metrics are stored into Cassandra for batch processing 4) Metrics Calculator context MCContext insert into PulsarEventCount Select count(*) as count from PulsarEvent output snapshot when terminated; @BroadCast @OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric") select * from PulsarEventCount; calculates PulsarEventCount:C “count”: 2 Outbound Message Channel Inbound Message Channel "Pulsar.Report/metric"
  12. 12. 12 5) Replay ● Every stage, events are stored in Kafka ● and Replays the failed events on other stages
  13. 13. 13 2. Demo (Twitter stream processing) Twitter Stream Twitter Stream Collector
  14. 14. 14 EPLs (Context) context MCContext insert into TwitterTopCountryCount Select count(*) as count, country from TwitterSample(country is not null) group by country output snapshot when terminated order by count(*) desc limit 10; context MCContext insert into TwitterTopLangCount Select count(*) as count, lang from TwitterSample(lang is not null) group by lang output snapshot when terminated order by count(*) desc limit 10; context MCContext insert into TwitterTopHashTagCount Select topKNested(1000, 20, hashtag, ',') as TopHashTag from TwitterSample(hashtag is not null) output snapshot when terminated; context MCContext insert into TwitterEventCount Select count(*) as count from TwitterSample output snapshot when terminated;
  15. 15. 15 EPLs (Select) @BroadCast @OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric") select * from TwitterTopCountryCount; @BroadCast @OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric") select * from TwitterTopLangCount; @BroadCast @OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric") select * from TwitterTopHashTagCount; @BroadCast @OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric") select * from TwitterEventCount;
  16. 16. 16 http://<hostname>:8088 Dashboard
  17. 17. 17 3. Pulsar's key points ● Creating pipelines declaratively ● SQL driven processing logic with hot deployment of SQL ● Framework for custom SQL extensions ● Dynamic partitioning and flow control ● < 100 millisecond pipeline latency ● 99.99% Availability ● < 0.01% data loss ● Cloud deployable
  18. 18. 18 4. Other Stream Processing Frameworks ● Storm(Trident) – Storm Transactional Topology – Stateful ● Storm(Esper) – Our solution developed in NexR Project – Integrates Esper ● Apache Spark – Fast and general cluster computing platform for Big Data – Support SQL
  19. 19. 19 Storm(+Esper) / Spark vs Pulsar Points Pulsar Storm(Trident) Storm(Esper) Spark Declarative pipeline wiring O X X X Pipeline stitching Run time Build time Build time Build time Hot deployment of topologies O X X X SQL support O X O O Hot deployment of processing rules O X O X Pipeline flow control O △ △ ? Stateful processing O O △ O <>
  20. 20. 20 References ● ng-pulsar-real-time-analytics-at- scale/#.VQIuqBCsVW2 ● ● ● ● ● ●
  21. 21. 21 Q & A