Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

4,094 views

Published on

Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid.
Presented at Strata + Hadoop World 2016

Published in: Technology
  • Be the first to comment

Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

  1. 1. Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid September 2016 Tony Ng Director of Engineering
  2. 2. EBAY BUSINESS
  3. 3. *Q2 2016 data New Items 80% Fixed Price Items 86% Evolving from our Auction Roots.. Ships for Free 65%
  4. 4. EBAY AT A GLANCE $8.6B Revenue in 2015 $82B GMV in 2015 1B Live Listings * 164M Global Active Buyers * 190 Countries eBay apps are available in # *Q2 2016 data #Q4 2015 data 326M App downloads*
  5. 5. CHECKOUT WATCH LIST SHARE LOCAL PREFERENCES BID 10’s TB USER BEHAVIORAL DATA / DAY 10’s B DATABASE QUERIES / DAY 100’s PB DATA CLICK SEARCH
  6. 6. ENTERPRISE DATA PLATFORM Data Stores Data Streams & Processing Machine Learning Enterprise Data Ecosystem Data Ingestion Personalization / Optimization Insights / Reporting Kylin
  7. 7. Open-source real-time analytics and stream processing framework
  8. 8. Pulsar Stream • Focus on user behavioral data processing • Complex event processing – Streaming SQL with extensible annotations – Java • SQL for common stream operations (Filtering, mutation, aggregation) with time windows • Declarative topology construction • Each stage can adopt its own release and deployment cycles • Dynamic partitioning and flow control
  9. 9. Multi Stage Distributed Pipeline
  10. 10. Event Filtering and Routing Example // create filtered stream insert into FilteredStream select guid, evt_type, C1, C2, C3 from RawStream where evt_type = ‘bid’; // publish and route filtered stream @PublishOn(topics=“Topic1”) @Output(“OutboundChannel”) @ClusterAffinityTag(column = guid) select * from SubStream;
  11. 11. Aggregate Computation Example // create 10-second time window context create context MCContext start @now and pattern (timer:interval(10)]; // create aggreated stream within specified time window context MCContext insert into AggStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type output snapshot when terminated; // publish aggregated stream select * from AggStream;
  12. 12. Stream Aggregation Time Window Sliding Window Tumbling Window
  13. 13. TopN Computation Example // create 60-second time window context create context MCContext start @now and pattern (timer:interval(60)]; // create topN stream via sorting context MCContext insert into TopNStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type order by M1 limit 10; // publish topN stream select * from TopNStream;
  14. 14. PULSAR BEHAVIORAL DATA PIPELINE
  15. 15. Pulsar Behavioral Data Pipeline Sessionizer Metrics Calculator Event Distributor Real Time Consumers Metrics Store Collector Real-time Pipeline BOT Detection Enriched Sessionized Events Producing Applications Real Time Dashboard and Services Kafka DruidHaddop / Kylin Batch Loader
  16. 16. Sessionization: Group together events of a single user visit e1 e2 e3 e4 e5 e7 e8 User 1: >30 min of inactivity Session A (User 1): e1, e2, e4 Session B (User 2): e3, e5, e6 Session C (User 1): e7, e8 e6 . . .
  17. 17. Sessionization Challenges • Session state management – High read/write throughput – State recovery when node crash/fail • Session Expiration – Full table scan is not acceptable
  18. 18. Sessionization Solution • Long live state management (At least 30 minutes) – Local Off-Heap Cache • Instantaneous Session Expiration (<= 1sec delay) – Double-Linked Off-Heap Map (Local Access) – Order by Expiration time (O(1)) • Pluggable Sessionization logic – SQL with customized annotation – Counter – State
  19. 19. Sessionizer Architecture Collector Sessionizer Sessionizer Local Off-Heap Session Cache I M C Timer Remote Store Client Remote Session Store Recovery O M C BotDetection Distributor Persist Sync
  20. 20. Bot Detection Overview • Detect non-human activities in near realtime • May treat bot traffic differently during analysis • High level bot rules – Self-declared bots by user agent – Behavior within a session or time window • Tag events with bot flag
  21. 21. Bot Detection SessionizerCollector BotDetection Distributor Behavior Events And Metrics BotSignature Bot Detection Service BotSignature Bot tagged Stream
  22. 22. PULSAR INTEGRATION WITH OTHER SYSTEMS Divider sub-headline goes here
  23. 23. Pulsar Integration with Kafka •Kafka – Persistent messaging queue – High availability, scalability and throughput •Pulsar leveraging Kafka – Supports pull and hybrid messaging model – Loading of data from real-time pipeline into Hadoop and other metric stores – Use schema to validate event payload 2
  24. 24. Messaging Models Producer Producer Queue Kafka Producer Queue Kafka Replayer Push Model Pull Model Pause/Resume Hybrid Model (At most once delivery semantics) (At least once delivery semantics) Consumer Consumer Consumer Netty
  25. 25. Pulsar Integration with Kylin •Apache Kylin – Distributed analytics engine – SQL interface and multi-dimensional analysis (OLAP) on Hadoop – Interactive Query on Billions of Rows •Pulsar leveraging Kylin – Build multi-dimensional OLAP cube over long time period – Aggregate/drill-down on dimensions such as browser, OS, device, geo location – Capture metrics such as session length, page views, event counts 2
  26. 26. Pulsar Integration with Druid •Druid – Real-time ROLAP engine for aggregation, drill-down and slice-n-dice •Pulsar leveraging Druid – Real-time analytics dashboard – Near real-time metrics like number of visitors in the last 5 minutes, refreshing every 10 seconds – Aggregate/drill-down on dimensions such as browser, OS, device, geo location 2
  27. 27. BEHAVIORAL DATA DRIVEN APPLICATIONS Divider sub-headline goes here
  28. 28. http://www.ebay.com/trending
  29. 29. TRENDING: ALGORITHMS NARROW THE FOCUS Algorithms and machine learning identify significant trends Humans provide the context and and interesting story
  30. 30. search view watch bid purchase w time NearlineOffline (historical) s vv…events… b Online (in-session) vpv s Activity Timeline for Personalization Customer Profile • Price • Category • Sale Type • Item Condition • Deals Customer Intent • Price • Category • Sale Type • Item Condition • Deals
  31. 31. EXAMPLE PERSONALIZED CONTENT Personalized digest for a consumer interested in jewelry and accessories Personalized digest for a consumer interested in auto and electronics
  32. 32. Behavior Data: A/B Testing 3 Behavioral Data: A/B Testing
  33. 33. More Information •GitHub: http://github.com/pulsarIO –repos: pipeline, framework, docker files •Website: http://gopulsar.io –Technical whitepaper –Getting started –Documentation •Google group: http://groups.google.com/d/forum/pulsar 3

×