Cassandra Day SV 2014: Building a Personalization Platform with Cassandra at eBay


Published on

We will describe the architecture of a personalization platform that captures customer profiles and behavioral data. A Cassandra cluster is used as an intermediate storage backend to replicate updates to profile records and timeline events across multiple data-centers. A caching tier serves up the user data and provides a real-time execution environment where predictive models can calculate propensities or update category histograms, etc.. We delve into metrics that are used to track replication performance and data freshness. We also discuss applications and features like user badges that are powered by this new P13N platform.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cassandra Day SV 2014: Building a Personalization Platform with Cassandra at eBay

  1. 1. Bullseye P13n Platform April 7, 2014 Charles Bracher Bullseye Dev Manager Ranjan Sinha, PhD Lead Research Scientist Bullseye  
  2. 2. Outline P13n Platform Why Cassandra? Cassandra Setup Cassandra Usage Cassandra Issues and Resolutions Hand over to Ranjan for the Data Science Perspective Bullseye
  3. 3. Bullseye Bullseye Functional Architecture Offline AnalysisOffline Database/ Batch Processing Recent User Data 1-5 days (Cassandra) Real Time Model Evaluation & Caching (sharded/full user state in memory) Client Access Near Real Time Event Collection Tracking Long Term User Data (Local SSD)
  4. 4. Why Cassandra? Great write performance Great replication performance Reasonable read performance Reasonable cost Client controlled consistency settings Bullseye
  5. 5. Cassandra Setup Cassandra Version 1.2.9 We use Replication –  Cassandra rings deployed to 3 datacenters Cassandra clients –  We use both the Datastax Java and C++ Beta clients Using CQL Table specifications and commands Not on SSDs Bullseye
  6. 6. Cassandra Usage Column Family Design: – Avoid Tombstones – Avoid Compaction With Focus on Short Term Storage: – Turn off automatic compaction / only manual compaction – Use unique column key names to avoid tombstones – Clear out old data with truncation Bullseye
  7. 7. Cache Miss Flow (New Session) Bullseye CREATE TABLE DAY_N (USER_ID TEXT, RECORD_NAME TEXT, RECORD_VALUE BLOB, PRIMARY KEY (USER_ID, RECORD_NAME)); Write to active day column family with key user id. Truncate the oldest day column family. When going from one day to the next, do a manual compaction for the old day. On read, pull user id info from all col. families newer than the local SSD data.
  8. 8. Queuing Flow (Ongoing Activity) Bullseye CREATE TABLE HOUR_N (ID TEXT, RECORD_NAME TEXT, RECORD_VALUE BLOB, PRIMARY KEY (ID, RECORD_NAME)); Read/Write from active hour with key timestamp rounded to nearest second Store the column family one hour old to offline DB Truncate the column family two hours old Do async probe of record for current second as well as recent seconds till state is captured. Data may be read 1-3 times. More if replication is lagging.
  9. 9. Cassandra Issues and Resolutions Issues with C++ Datastax Cassandra beta client – open sourced, so could apply fixes Performance issues with the cache miss query – increased heap size – reduced replication factor – turned off cross colo read repair – deployed data center aware policy for C++ Bullseye
  10. 10. Personalization Applications Ranjan Sinha, PhD Lead Research Scientist April 7, 2014 Disclaimer: Some of the content in this talk is based on my personal opinion. It does not reflect the views of ebay.
  11. 11. Outline Why Personalize? P13N Platform – Introduction – Conceptual architecture – Modeling stages P13N Applications – User badges – Search ranking – Contextual models – Deals Personalization Applications
  12. 12. Why Personalize? Enable more relevant experience Retention of existing users New user acquisition Reactivating churned users Increasing activity per user Improving conversion from visits to transactions Personalization Applications
  13. 13. P13N Platform: Introduction Maintains activity timeline information Enables event processing at near real-time Enables in-session personalization Provides environment for predictive model evaluation Backup and restore to and from Hadoop/HBase Personalization Applications
  14. 14. P13N Platform: Conceptual Architecture Personalization Applications Tracking Event Source m1 m3m2 …. Model Executor Filters and forwards events Activity Timeline + User Badges In-memory Cache + Model Evaluation CEP Processor Client Access Hadoop/ HBase Offline Modeling Platform User Badges mn Cassandra
  15. 15. P13N Platform: Modeling stages Realtime – In-session user intent – Contextual Models Nearline – Update propensity models (aka User Badges) Offline – Bootstrap propensity models by mining long-term behavior history Personalization Applications
  16. 16. Application (1): User Badges Personalization Applications Name Description SaleType Auction vs. Buy-it-now ItemCondition New vs. Used Category Preference of categories Price Price range of purchasing activity Deals Propensity to purchase deals Social Share Propensity to share items in social media Profile based on long-term behavior
  17. 17. Application (2): Search Ranking … Should all queries be personalized in the same manner? – For some queries (ebay or google), everyone would like the same results – For other queries, different people may want completely different results Personalization Applications Query: “big ben puzzles” Not_P13N Rank P13N Rank Sold IsNew Title 1 1 No No LOT OF 7 BIG BEN PUZZLES 5/1000PC. 2/1500 PUZZLES EUC 2 3 No Yes 1000 Pc MB Big Ben Jigsaw Puzzle Mount Shuksan North Cascades National Park WA 3 2 Yes No COMPLETE Fishing Village,Smalls Island MB Big Ben Puzzle 1000 Piece Puzzle Size! User: always buys used items
  18. 18. Application (3): Contextual models … Personalization Applications Infer categories that user is interested in within the current session Long and Short term behavior – Historic behavior may provide benefits at the start of the session – Short-term behavior may contribute gains in an extended search session – Combination of session and historic behavior may outperform using either alone e2 t Nearline, after session expiry Online, in-session Offline, historical e3e1 …events… e1 Event source
  19. 19. Application (4): Deals Personalization Applications Personalize categories Personalize modules Personalize tabs Personalize items
  20. 20. fin Personalization Applications