Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Clickstream Analytics at BazaarvoiceEvan Pollan, Engineering Lead                                 @EvanPollan
Agenda •    Infrastructure: lessons learned operating Hadoop in EC2 •    Case study: uniques at scale using Hadoop and HBa...
Project Magpie •    Bazaarvoice products – extremely large web surface area •    Client-side instrumentation to measure in...
Infrastructure Whys •    Why Hadoop?       – Experience scaling brute-force log processing via Hadoop          • Everybody...
High-level architecture •    Event collectors in auto-scale groups behind elastic load balancers •    Event stream compres...
EMR vs. roll our own •    Neither •    Cons: EMR       – Price premium       – Opaque Hadoop configuration       – No way ...
Missteps •    Problem: non-HA NameNode •    Solution: EBS! •    Problem: EC2 MTBF iffy •    Solution: EBS! •    Reality: W...
Where we’ve ended up •    Moved to the latest Cloudera CDH 4.X – HA NameNode!       – Zookeeper for leader election       ...
Let’s talk setsConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
Let’s talk sets •    Common problem: uniques (e.g. unique visitors, users, etc.) •    Naïve solution: SELECT DISTINCT(X) F...
Set Unions •    Definition: cardinality of a set is the number of elements in that set       – A = {1, 2, 3}; |A| = 3 •   ...
An entirely different set: bit sets •    Translate set members’ identifiers to an index in a bit set •    Bit sets are com...
Bit sets – solving the size problem •    109 bits is an expensive way to store a combinable cardinality •    Query I/O exa...
Cardinality Estimation •    Many different approaches to estimate the cardinality of a set       – General goal: calculate...
Nuts & Bolts •    http://github.com/clearspring/stream-lib       – Java impls of top-K, frequency, and cardinality for str...
Nuts & Bolts •    Reducer output: HBase Put •    HBase “schema”, e.g. daily uniques aggregated by brand: •    Scan:       ...
Nuts & Bolts •    HBase scan is the key to making this fast       – First result: instantiate HyperLogLog estimator       ...
Upcoming SlideShare
Loading in …5
×

Austin Scales- Clickstream Analytics at Bazaarvoice

1,420 views

Published on

Evan Pollan talks about Bazaarvoice's Hadoop infrastructure for clickstream analytics, as well as an approach to large-scale cardinality analysis using Map/Reduce and HBase.

Published in: Technology
  • Be the first to comment

Austin Scales- Clickstream Analytics at Bazaarvoice

  1. 1. Clickstream Analytics at BazaarvoiceEvan Pollan, Engineering Lead @EvanPollan
  2. 2. Agenda • Infrastructure: lessons learned operating Hadoop in EC2 • Case study: uniques at scale using Hadoop and HBaseConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  3. 3. Project Magpie • Bazaarvoice products – extremely large web surface area • Client-side instrumentation to measure interactions • Many event sources (apps) => one sink: Magpie • Consolidated HTTP event collection – Network-wide event correlation – Network ~ many apps and many “sites” (clients) • Clickstream == Topically segmented JSON event log files • Sense of scale – 10 - 20K events per second – 500M – 1B impressions per day – 25 – 50 GB compressed event log data per dayConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  4. 4. Infrastructure Whys • Why Hadoop? – Experience scaling brute-force log processing via Hadoop • Everybody’s favorite: Akamai edge request logs • EMR, Apache Whirr – Needed online analytics – HBase fit the bill – Apache OSS ecosystem familiar to BV • Why Amazon Web Services? – Existing infrastructure hosting solution too inflexible and slow – Couldn’t scale R&D without an elastic infrastructureConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  5. 5. High-level architecture • Event collectors in auto-scale groups behind elastic load balancers • Event stream compressed and uploaded hourly to S3 • S3: store of record • Hadoop cluster: – HDFS: stores raw event logs, derived file-based data sets, and HBase HFiles/WALs – Oozie: job scheduling, data dependency management – MapReduce: analytics (mix of Pig, Java => 100% Java) – HBase: stores hourly/daily analytics results • Job Portal: job schedule viz, gap analysis & alerting • UI/API: Analytics available via JSON API and in Backbone.js UIConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  6. 6. EMR vs. roll our own • Neither • Cons: EMR – Price premium – Opaque Hadoop configuration – No way to mitigate SPOFs • Cons: Roll our own – Small group of engineers, no ops manpower at beginning • Solution: Cloudera – Cloudera Manager for config management and provisioning – CDH 3.X distributionConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  7. 7. Missteps • Problem: non-HA NameNode • Solution: EBS! • Problem: EC2 MTBF iffy • Solution: EBS! • Reality: When something goes wrong in AWS, it is invariably an outage or degradation in EBS. – Violates the whole concept of data locality. Hadoop + SAN = sadness • Problem: Where should HBase live? • Solution: Co-resident with MapReduce!Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  8. 8. Where we’ve ended up • Moved to the latest Cloudera CDH 4.X – HA NameNode! – Zookeeper for leader election – Quorum Journal Manager for edit logs • Learn to let go – Mitigate SPOF where possible, but plan for failure – End-to-end automation for DR/migration • Avoid EBS like the plague • HBase and MapReduce segmentation – Enables different hardware step size – Batch processing doesn’t affect HBase response time – Better understanding of HBase/HDFS locality (or lack thereof)Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  9. 9. Let’s talk setsConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  10. 10. Let’s talk sets • Common problem: uniques (e.g. unique visitors, users, etc.) • Naïve solution: SELECT DISTINCT(X) FROM Y • Not tenable given: – Massive, semi-structured data set – Thousands of grouping axes • OK: pre-calculate via MapReduce • But… – What would you pre-calculate? – Daily for each grouping? – How would you answer queries for other time ranges? Pre- calculate them, too?Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  11. 11. Set Unions • Definition: cardinality of a set is the number of elements in that set – A = {1, 2, 3}; |A| = 3 • Cardinality of the union of two sets cannot be determined from the cardinality of the two sets – |A U B| not necessarily equal to |A| + |B| – Only equal if A and B are disjoint – How do you know if they’re disjoint? – You need both sets • Imagine: – Set “a” are the visitors from yesterday – Set “b” are the visitors from today – To get uniques for both days, you have to look at both data setsConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  12. 12. An entirely different set: bit sets • Translate set members’ identifiers to an index in a bit set • Bit sets are combinable – yahtzee! • HBase is good at storing bits  – MapReduce to build bit set for each grouping in your smallest desirable unit of time – Persist w/ row key as a function of date and grouping • Uniques for last month? – Scan: start and stop rows accounting for date range and grouping – Merge each day’s bit set with a single bit set representing the union – Count the number of “on” bits in the merged bit set => cardinality • But… – # bits for items whose identifiers number in the billions? – A billion bits is a lot of bitsConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  13. 13. Bit sets – solving the size problem • 109 bits is an expensive way to store a combinable cardinality • Query I/O example: Uniques for last quarter – 120 MB/day * 90 days = 10.8 GB – Too much to pull out of HBase to answer an “online” query • Storage example: 10K different grouping axes – Clients, sites, favorite colors, whatever – 120 MB * 10K = 1.2 TB/day of storage • Possible mitigation: compression – Still need to generate a 120 MB data structure, then compressConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  14. 14. Cardinality Estimation • Many different approaches to estimate the cardinality of a set – General goal: calculate cardinality in small RAM footprint • Big breakthrough in 2007: the HyperLogLog estimator • What’s the big deal? – Tunable accuracy – Incredible information density – Combinable • Analog: lossy compression of bit sets • How good? – Estimate cardinality of 109 unique elements +/-2% in 1.5 KBConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  15. 15. Nuts & Bolts • http://github.com/clearspring/stream-lib – Java impls of top-K, frequency, and cardinality for streams • A ha moment: combining estimators from distributed counters is no different than combining them across different time periods! • MapReduce algorithm – map(Event) : (key, identifier) • key is what ever grouping you want uniques for – Shuffle sorts all key, identifier tuples by key – reduce(key, Iterable<identifier>) : estimator bytes • Reducer simply updates the estimator in-place – tiny RAM footprintConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  16. 16. Nuts & Bolts • Reducer output: HBase Put • HBase “schema”, e.g. daily uniques aggregated by brand: • Scan: Row Key Estimator – brandX – Jan 2-3 brandX-20130101 [0100110111000] [0110100111000] brandX-20130102 [0110100111000] [0100000101011] brandX-20130103 [0100000101011] brandY-20130101 [0101100011000] [0110100111011] brandY-20130102 [0100100111001] Cardinality = NConfidential and Proprietary. © 2012 Bazaarvoice, Inc.
  17. 17. Nuts & Bolts • HBase scan is the key to making this fast – First result: instantiate HyperLogLog estimator – Remaining results: update estimator in-place • O(n) to compute result, n ~ number of bits in estimator (1.5KB) • Freedom to build a data set of unique estimators that can be arbitrarily sliced quickly – Quarterly, daily, weekly, ad-hoc date ranges – HBase client pulls 1.5KB * number of days, returns a long – Perf anecdote: REST API call to get network-wide uniques for current month-to-date • 66 ms over the internet • 12 ms server-side latencyConfidential and Proprietary. © 2012 Bazaarvoice, Inc.

×