• Save
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural Case Study

Like this? Share it with your network

Share

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural Case Study

  • 3,307 views
Uploaded on

Presented by: Murtaza Doctor (RichRelevance) and Giang Nguyen (RichRelevance)

Presented by: Murtaza Doctor (RichRelevance) and Giang Nguyen (RichRelevance)

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,307
On Slideshare
3,300
From Embeds
7
Number of Embeds
2

Actions

Shares
Downloads
1
Comments
0
Likes
9

Embeds 7

https://twitter.com 4
http://localhost 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • How we plan to go over stuff
  • The RichRelevance DataMesh Cloud Platform delivers a single view of your customer by:Giving you one single place to house unlimited sets of dataExample use cases:Create your own run-time strategies (predictive models)Create and manage segments via toolAutomatic & real-time segment creationView performance of strategies against KPIs Run adhoc queries using SQL-like toolImport into offline toolsOLAP capabilitiesMarket Basket AnalysisCustomer Lifetime ValueSequential Pattern miningManage APIs, build products & applications
  • Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
  • Distributed System:: i.e. producers, brokers and consumer entities can all be deployed to different hosts in different colos in a truly distributed fashion and coordination controlled through zookeeperPersistence of Messages: messages need to be persisted on the broker for reliability, replay and temporary storagePush & Pull Mechanism:: i.e. push data to Kafka server and pull data from it using a consumer. This allows for two different rates: rate at which messages are transferred to the kafka server and the rate at which the messages are consumed.: Kafka supports GZIP and version 0.8 will additionally support Snappy compression.

Transcript

  • 1. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Murtaza Doctor Principal Architect Giang Nguyen Sr. Software Engineer How we solved Real-Time User Segmentation using Hbase
  • 2. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Outline • {rr} Story • {rr} Personalization Platform • User Segmentation: Problem Statement • Design & Architecture • Performance Metrics • Q&A
  • 3. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Personalized Product Recommendations • Targeted Promotional Content • Relevant Brand Advertising RichRelevance delivers:
  • 4. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Multiple Personalization Placements Personalized Product Recommendations Targeted Promotions Relevant Brand Advertising Inventory and Margin Data Catalog Attribute Data Real time user behavior Multi-Channel purchase history 3rd Party Data Sources Future Data Sources Input Data 100+ algorithms dynamically targeted by page, context, and user behavior, across all retail channels. Targeted content optimization to customer segments Monetization through relevant brand advertising
  • 5. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.  One place where all data about your customer lives (e.g. loyalty, customer service, offline, catalogue)  Direct access to data for your entire enterprise via API  Real-time actionable data and business intelligence  Link customer activity across any channel  Leverage 3rd party data (e.g. weather, geography, census) Delivering a Customer-centric Omni-channel Data model RichRelevanceDataMesh Real-time Segmentation DMP Integration RichRelevance DataMesh Cloud Platform Delivering a Single View of your Customer Event-based Triggers Ad Hoc Reporting Omni-channel Personalization Loyalty Integration
  • 6. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Azkaban, Voldemort, Kafka Someone clicks on a {rr} recommendation every 21 milliseconds Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in real-time In the US, we serve 7000 requests per second with an average response time of 50 ms
  • 7. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. User Segmentation Finding and Targeting the Right Audience Example segment : savvy shopper • Demographics – Female – Age 30 to 25 • DMA – 98004 • Behavior – Purchased a Gucci hand bag last month, Louis Vuitton hand bag 2 weeks back, Chanel hand bag last week – Viewed products in the Cartier brand 2 days ago
  • 8. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Utilizes valuable targeting data such as event logs, off-line data, DMA, demographics, and etc. • Finds highly qualified consumers and buckets them into segments – Example: for Retail, Segment Builder supports view, click, purchase on products, categories, and brands What is the {rr} Segment Builder?
  • 9. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segment Builder Segment’s list page
  • 10. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segment Builder Add/Edit segment page
  • 11. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Create segments to capture the audience via web UI – Segment definition consists of one or more behaviors joined together using logical expressions – Example: Find all users who viewed a category in the last 7 days at least 3 times, or clicked on a product at least 2 times in the last 2 days, and from zip code 90210 • Each behavior is captured by a rule • Each rule corresponds to a row key in HBase • Each rule returns the users • Rules are joined using set union or intersection • Segment definition consists of one or more rules Design: Segment Evaluation Engine
  • 12. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. 1. Segment Builder Add rule 2. 3.
  • 13. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segment Builder Add additional rule 1. 3. 2. 4.
  • 14. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • User interacts with a retail site • Events are triggered in real-time, each event is converted into an Avro record • Events are ingested using our real-time framework built using Apache Kafka Let’s start with real-time data ingestion
  • 15. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Design Principles: Real-Time Solution • Streaming: Support for streaming versus batch • Reliability: no loss of events and at least once delivery of messages • Scalable: add more brokers, add more consumers, partition data • Distributed system with central coordinator like zookeeper • Persistence of messages • Push & pull mechanism • Support for compression • Low operational complexity & easy to monitor
  • 16. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Our Decision: Apache Kafka • Distributed pub-sub messaging system • More generic concept (Ingest & Publish) • Can support both offline & online use- cases • Designed for persistent messages as the common case • Guarantee ordering of events • Supports gzip & snappy (0.8) compression protocols • Supports rewind offset & re-consumption of data • No concept of master node, brokers are peers
  • 17. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 18. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Common Consumer Framework
  • 19. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Volume Facts? • Daily clickstream event data ~ 150 GB • Average size of message ~ 2 KB • Batch Size ~ 5000 messages • Producer throughput: 5000 messages/sec • Real time Hbase Consumer: 7000 messages/sec on average
  • 20. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. End-to-end Real-time Story… • User exhibits a behavior • Events are generated at front-end data-center • Events are streamed to backend data center via Kafka • Events are consumed and indexed in HBase tables • Takes seconds from event origination to indexing in HBase
  • 21. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. site.com [retailer] runtime [websever] clickstream events clicks views purchases kafka Producer kafka Broker runtime colo backend colo mirrored Kafka Broker HBase Consumer HBase Segment Manager Segment History Segment Evaluator avro events dimension tables events available in msecs shopper End-to-end Real-time Story…
  • 22. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Row key represents the behavior • Columns store the user id • Cell stores the time when user exhibited behavior. Used to capture frequency • One column family U Design: HBase Table Rowkey Columns 338VBChanel1370377143 23b93laddf82:1370377143973 Hd92jslahd0a:1313323414192 338CCElectronic1370377143 z3be3la2dfa2:1370477142970 kd9zjsla3d01:1313323414192
  • 23. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Row key contains an attribute value, siteId, metric, and attribute, followed by the timestamp that this occured, and finally the userGUID [salt]_len_[siteId]_len_[metric]_len_[dimension]_len_[value] [timestamp] • _len_ is the integer length of the field following it. Fixed- length fields (timestamp) don't get a length • Timestamp is stored in day granularity • [salt] is computed by first creating a hash from the siteId, metric, and dimension, then combining this with a random number between 0 and N (number of shards) for sharding Design: User Index
  • 24. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Each row is sharded to prevent hot-spotting • Popular products or high level categories can have millions of users, each day • Each row key is sharded to keep the row small • Shard into N number of regions (predefined) • Calculate the hash with a random number between 0 and N; we then spread this load to N regions Design: Row Sharding
  • 25. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Each row key returns a set of users • OR = Full outer join • AND = Inner join • Done in memory for small rules • Merged on disk for large rules Design: Joins
  • 26. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segmentation Performance • Data indexed in real time – User browses site and clicks product and is added to a segment in less than a second • 40K puts/sec over 2 tables, 8 regions per table • Scaling achieved through addition of regions • Segments with 4-5 attributes/dimensions can be evaluated in msecs • Large segments with millions of users can be evaluated in secs
  • 27. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Real-time data ingestion an important ingredient • Indexing of segmentation attributes • HBase key design critical for performance • Distributed data centers complicates matters Lessons Learned
  • 28. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 29. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Thank You!