Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
High Volume Streaming Analytics
with CDAP
Jialong Wu
August 19, 2015
COLLECT
• Web Data
• Mobile Data
• Set-Top Data
• CRM Data
ORGANIZE
• Data into default and
custom hierarchies
• Easily ma...
3
Client Dashboard
4
Number of unique site visitors (profiles) that exhibit certain
behavior(s) during some time period
Useful for estimating a...
Counting Uniques
Old Approach New Approach
• MapReduce-based batch
processing
• Re-scans historical profile data 
Expensi...
Estimates distinct values (DV) in a set by the length of longest run of
trailing zeros for all the hash values in the set
...
HLL Accuracy
• Uses register index bits of 14 (~16K bins)
• Unbiased error (mean is 0% error)
• Standard deviation 0.8%
• ...
Good API abstraction layer for building data processing applications
Faster Time-To-Market
Lower entry barrier for develop...
Architecture
10
CDAP Flow
11
• Extends CDAP Kafka Flowlet Library
• Each Flowlet instance pulls from one topic-partition
• Configurable st...
CDAP Flow
12
• Encapsulates attribute counting logics in pluggable processors
• Attribute tuple = Attribute values + Profi...
CDAP Flow
13
• Updates in-memory HLL objects for the incoming attribute tuple
• Flushes in-memory HLL objects to Datasets ...
Encodes attribute tuple in a byte array
Row key components:
Prefix byte
Hour ID (ddHH)
Client ID
Attribute Type
Attribute
...
Datasets
Pre-split regions to distribute write load
Proper key format to avoid hotspot
Keep key length short
Use the lowes...
Uniques stats are available hours earlier
Consumes less computing resources
Ability to (re-)process uniques stats retrospe...
Thank You!
jwu@lotame.com
http://www.lotame.com
Upcoming SlideShare
Loading in …5
×

High Volume Streaming Analytics with CDAP: 08/19 Big Data Application Meetup, Talk #2

853 views

Published on

Speaker: Jia-long Wu from Lotame
Big Data Applications Meetup, 08/19/2015
Palo Alto, CA

More info here: http://www.meetup.com/BigDataApps/

Link to talk: https://www.youtube.com/watch?v=GRtQFceQG6k


About the talk:
In this talk, we’ll present the design of our new data stream processing application at Lotame and describe how we achieve significant reduction in cluster resource utilization while allowing faster updates of client audience data and better ad-hoc query support with the new platform.

We will examine the challenges faced in counting uniques in a high volume stream processing environment, and present a novel approach using time windowed HyperLogLog aggregates. We’ll also discuss how CDAP enable us to roll out this new platform quickly and share some valuable lessons and best practices we learned during the development cycle.

Published in: Software
  • Be the first to comment

High Volume Streaming Analytics with CDAP: 08/19 Big Data Application Meetup, Talk #2

  1. 1. High Volume Streaming Analytics with CDAP Jialong Wu August 19, 2015
  2. 2. COLLECT • Web Data • Mobile Data • Set-Top Data • CRM Data ORGANIZE • Data into default and custom hierarchies • Easily managed in folder structure ACTIVATE • Deliver targeted ad campaigns and marketing promotions • Dynamically serve content based on audience segment • Generate advanced audience analytics LOTAME DMP 2
  3. 3. 3
  4. 4. Client Dashboard 4
  5. 5. Number of unique site visitors (profiles) that exhibit certain behavior(s) during some time period Useful for estimating audience reach, tracking behavior trend, etc. Report stats in Daily, Month-to-date, 30 day intervals Roll up by client-network and behavior-category hierarchies sum(Uday1,…,Uday30) != U30day Counting Uniques 5
  6. 6. Counting Uniques Old Approach New Approach • MapReduce-based batch processing • Re-scans historical profile data  Expensive • Half day wait for the previous day’s stats • Fixed time buckets for counting • Real-time event stream processing with CDAP • Uses HyperLogLog for estimating uniques count • No re-scanning of historical profile data • Flexible, on demand aggregation and deduplication • Allows frequent stats updates
  7. 7. Estimates distinct values (DV) in a set by the length of longest run of trailing zeros for all the hash values in the set Improves the estimate by averaging results from multiple estimators Splits input into 2p bins, p is number of register index bits Takes harmonic mean of estimates to filter extreme outliers Set unions are lossless Allows distributed computing of DV count Estimating set intersections: Accuracy is low for sets that have high cardinality ratio or low overlap Space saving: 1% error rate for 1B count using 1.5KB in memory HyperLogLog 7
  8. 8. HLL Accuracy • Uses register index bits of 14 (~16K bins) • Unbiased error (mean is 0% error) • Standard deviation 0.8% • Accurate 99.9% of the time with < 2.5% error • Exact count for low cardinality • Stores hash values in Trove sets 8
  9. 9. Good API abstraction layer for building data processing applications Faster Time-To-Market Lower entry barrier for developing big data application Better reusability for data and processing patterns Support for both stream and batch processing paradigm Shares data across programs in different paradigms “Exactly once” transactional processing Good support for application development cycle No distributed components required for development CDAP Modes: In-Memory, Standalone, Distributed Why CDAP ? 9
  10. 10. Architecture 10
  11. 11. CDAP Flow 11 • Extends CDAP Kafka Flowlet Library • Each Flowlet instance pulls from one topic-partition • Configurable starting point for reading data
  12. 12. CDAP Flow 12 • Encapsulates attribute counting logics in pluggable processors • Attribute tuple = Attribute values + Profile ID hash • Each processor emits to its own Flowlet queue • Emits processed attribute tuples with the hash of attribute value as partition key
  13. 13. CDAP Flow 13 • Updates in-memory HLL objects for the incoming attribute tuple • Flushes in-memory HLL objects to Datasets every minute • Uses Hash-based Partition Strategy for better scalability
  14. 14. Encodes attribute tuple in a byte array Row key components: Prefix byte Hour ID (ddHH) Client ID Attribute Type Attribute Event timestamp (yyyyMMddHHmm) Prefix byte: hash(client_id +attribute type + attribute) mod (# of buckets) Optimized for writes and specific scan patterns Dataset Row Key 14
  15. 15. Datasets Pre-split regions to distribute write load Proper key format to avoid hotspot Keep key length short Use the lowest conflict detection level possible Flow Queues Use hash-based partitioning strategy for queues Minimize payload size between Flowlets Pick appropriate batch size for processing (between 100 and 500) Balance transaction duration and size of work in Flowlets Monitor Flowlet pending events metric closely Best Practices 15
  16. 16. Uniques stats are available hours earlier Consumes less computing resources Ability to (re-)process uniques stats retrospectively in an efficient manner Bug fixes New features Enables new products that utilizes real-time feedbacks Gains 16
  17. 17. Thank You! jwu@lotame.com http://www.lotame.com

×