Causata HBase Deployment Presentation


Published on

Causata HBase Presentation: Mixing Low Latency with Analytical Workloads for Customer Experience Management

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Causata HBase Deployment Presentation

  1. 1. Mixing low latency with analyticalworkloads for Customer ExperienceManagementJune 13, 2013Neil Ferguson
  2. 2. www.causata.comCausata Overview• Real-time Offer Management– Involves predicting something about acustomer based on their profile– For example, predicting if somebody isa high-value customer when decidingwhether to offer them a discount– Typically involves low latency(< 50 ms) access to an individualprofile– Both on-premise and hosted• Analytics– Involves getting a large set ofprofiles matching certain criteria– For example, finding all of thepeople who have spent more than$100 in the last month– Involves streaming access to largeamounts of data (typically millionsof rows / sec per node)– Often ad-hoc
  3. 3. www.causata.comSome History• Started building our platform 4 ½ years ago• Started on MySQL– Latency too high when reading large profiles– Write throughput too low with large data sets• Built our own custom-built data store–Performed well (it was built for our specific needs)–Non-standard; maintenance costs• Moved to HBase last year– Industry standard; lowered maintenance costs– Can perform well!
  4. 4. www.causata.comOur Data• All data is stored as Events, each of which has thefollowing:– A type (for example, “Product Purchase”)– A timestamp– An identifier (who the event belongs to)– A set of attributes, each of which has a type and value(s), forexample:• “Product Price -> 99.99• “Product Category” -> “Shoes”, “Footwear”
  5. 5. www.causata.comOur Storage• Only raw data is stored (notpre-aggregated)• Event table (row-oriented):– Stores data clustered by user profile– Used for low latency retrieval ofindividual profiles for offermanagement, and for bulk queries foranalytics• Index table (“column-oriented”):– Stores data clustered by attribute type– Used for bulk queries (scanning) foranalytics• Identity Graph:– Stores a graph of cross-channelidentifiers for a user profile– Stored as an in-memorycolumn family in the Eventstable
  6. 6. www.causata.comMaintaining Locality• Data locality (with HBase client) gives around a60% throughput increase– Single node can scan around 1.6 million rows / second with RegionServer on separate machine– Same node can scan around 2.5 million rows / second with RegionServer on the local machine• Custom region splitter: ensures that (wherepossible), event tables and index tables are split atthe same point– Tables divided into buckets, and split at bucket boundaries• Custom load balancer: ensures that index table datais balanced to the same RS as event table data• All upstream services are locality-aware
  7. 7. www.causata.comQuerying CausataFor each customer who has spent more than $100, get productviews in the last week from now:SELECT S.product_views_in_last_weekFROM Scenarios SWHERE S.timestamp = now()AND total_spend > 100;For each customer who has spent more than $100, get productviews in the last week from when they purchased something:SELECT S.product_views_in_last_weekFROM Scenarios S, Product_Purchase PWHERE S.timestamp = P.timestampAND S.profile_id = P.profile_idAND S.total_spend > 100;
  8. 8. www.causata.comQuery Engine• Raw data stored in HBase, queries typicallyperformed against aggregated data– Need to scan billions of rows, and aggregate on the fly- Many parallel scans performed:- Across machines (obviously)- Across regions (and therefore disks)- Across cores• Queries can optionally skip non-compacted data(based on HFile timestamps)– Allows result recency to be traded for performance• Some other performance tuning:- Shortcircuit reads turned on (available from 0.94)- Multiple columns combined into one
  9. 9. www.causata.comParallelismSingle Region Server, local client, all rows returned to client, disk-bound workload(disk cache cleared before test), ~1 billion rows scanned in total, ~15 bytes per row (ondisk, compressed), 2 x 6 core Intel(R) X5650 @ 2.67GHz, 4 x 10k RPM SAS disks,48GB RAM
  10. 10. www.causata.comRequest Prioritization• All requests to HBase go through a single thread pool• This allows requests to be prioritized according tosensitivity to latency• “Real-time” (latency-sensitive) requests are treatedspecially• Real-time request latency is monitored continuously,and more resources allocated if deadlines are not met
  11. 11. www.causata.comQuestions…Email: neilf at causata dot comWeb: http://www.causata.comTwitter: @causata