Retail Reference Architecture Part 3: Scalable Insight Component Providing User History, Recommendations and Personalization
 

Retail Reference Architecture Part 3: Scalable Insight Component Providing User History, Recommendations and Personalization

on

  • 381 views

During this session we will cover the best practices for implementing the insight component with MongoDB. This includes efficiently ingesting and managing a large volume of user activity logs, such as ...

During this session we will cover the best practices for implementing the insight component with MongoDB. This includes efficiently ingesting and managing a large volume of user activity logs, such as clickstreams, views, likes and sales. We'll dive into how you can derive user statistics, product maps and trends using different analytics tools like the aggregation framework, map/reduce or the Hadoop connector. We will also cover operational considerations, including low-latency data ingestion and seamless aggregation queries.

Statistics

Views

Total Views
381
Views on SlideShare
376
Embed Views
5

Actions

Likes
2
Downloads
8
Comments
0

2 Embeds 5

http://www.mongodb.com 3
http://www.slideee.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Fix stream box. Add validator box.
  • Would be useful to have diagram that mixes shards and time partitions

Retail Reference Architecture Part 3: Scalable Insight Component Providing User History, Recommendations and Personalization Retail Reference Architecture Part 3: Scalable Insight Component Providing User History, Recommendations and Personalization Presentation Transcript

  • Retail Reference Architecture with MongoDB Antoine Girbal Principal Solutions Engineer, MongoDB Inc. @antoinegirbal
  • Introduction
  • MongoDB Overview
  • 4 MongoDB Strategic Advantages Horizontally Scalable -Sharding Agile Flexible High Performance & Strong Consistency Application Highly Available -Replica Sets { customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}
  • 5 Documents let you build your data to fit your application Relational MongoDB { customer_id : 1, name : "Mark Smith", city : "San Francisco", orders: [ { order_number : 13, store_id : 10, date: “2014-01-03”, products: [ {SKU: 24578234, Qty: 3, Unit_price: 350}, {SKU: 98762345, Qty: 1, Unit_Price: 110} ] }, { <...> } ] } CustomerID First Name Last Name City 0 John Doe New York 1 Mark Smith San Francisco 2 Jay Black Newark 3 Meagan White London 4 Edward Danields Boston Order Number Store ID Product Customer ID 10 100 Tablet 0 11 101 Smartphone 0 12 101 Dishwasher 0 13 200 Sofa 1 14 200 Coffee table 1 15 201 Suit 2
  • 6 Notions RDBMS MongoDB Database Database Table Collection Row Document Column Field
  • Architecture Overview
  • 8 Information Management Merchandising Content Inventory Customer Channel Sales & Fulfillment Insight Social Architecture Overview Customer Channels Amazon Ebay … Stores POS Kiosk … Mobile Smartphone Tablet Website Contact Center API Data and Service Integration Social Facebook Twitter … Data Warehouse Analytics Supply Chain Management System Suppliers 3rd Party In Network Web Servers Application Servers
  • 9 Commerce Functional Components Information Layer Look & Feel Navigation Customization Personalization Branding Promotions Chat Ads Customer's Perspective Research Browse Search Select Shopping Cart Purchase Checkout Receive Track Use Feedback Maintain Dialog Assist Market / Offer Guide Offer Semantic Search Recommend Rule-based Decisions Pricing Coupons Sell / Fullfill Orders Payments Fraud Detection Fulfillment Business Rules Insight Session Capture Activity Monitoring Customer Enterprise Information Management Merchandising Content Inventory Customer Channel Sales & Fulfillment Insight Social
  • Merchandising
  • 11 Merchandising Merchandising MongoDB Product Variation Product Hierarchy Pricing Promotions Ratings & Reviews Calendar Semantic Search Product Definition Localization
  • 12 • Single view of a product: Single scalable catalog service used by all services and channels • Read volume is high and sustained • Write volume spikes up during catalog update, but also allows real-time updating of a product • Advanced indexing and querying is a requirement: find product by SKU, category, color, etc • Geographical distribution and low latency achieved through replication • Scaling achieved through sharding Merchandising - principles
  • 13 Merchandising - requirements Requirement Example Challenge MongoDB Single-view of product Blended description and hierarchy of product to ensure availability on all channels Flexible document-oriented storage High sustained read volume with low latency Constant querying from online users and sales associates, requiring immediate response Fast indexed querying, replication allows local copy of catalog, sharding for scaling Spiky and real-time write volume Bulk update of full catalog without impacting production, real-time touch update Fast in-place updating, real- time indexing, , sharding for scaling Advanced querying Find product based on color, size, description Ad-hoc querying on any field, advanced secondary and compound indexing
  • 14 Merchandising - Product Page Product images General Informatio n List of Variations External Informatio n Localized Description
  • 15 > db.definitions.findOne() { productId: "301671", // main product id department: "Shoes", category: "Shoes/Women/Pumps", brand: "Guess", thumbnail: "http://cdn…/pump.jpg", image: "http://cdn…/pump1.jpg", // larger version of thumbnail title: "Evening Platform Pumps", description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit", shortDescription: "Evening Platform Pumps", style: "Designer", type: "Platform", rating: 4.5, // user rating lastUpdated: Date("2014/04/01"), // last update time … } Merchandising - Product Definition
  • 16 • Get item from Product Id db.definition.findOne( { productId: "301671" } ) • Get item from Product Ids db.definition.findOne( { productId: { $in: ["301671", "301672" ] } } ) • Get items by department db.definition.find({ department: "Shoes" }) • Get items by category prefix db.definition.find( { category: /^Shoes/Women/ } ) • Indices productId, department, category, lastUpdated Merchandising - Product Definition
  • 17 > db.variations.findOne() { _id: "730223104376", // the sku productId: "301671", // references product id thumbnail: "http://cdn…/pump-red.jpg", image: "http://cdn…/pump-red.jpg", // larger version of thumbnail size: 6.0, color: "Red", width: "B", heelHeight: 5.0, lastUpdated: Date("2014/04/01"), // last update time … } Merchandising - Product Variation
  • 18 • Get Variation from SKU db.variation.find( { _id: "730223104376" } ) • Get all variations for a product, sorted by SKU db.variation.find( { productId: "301671" } ).sort( { _id: 1 } ) • Indices productId, lastUpdated Merchandising - Product Variation
  • 20 Price: { _id: "sku730223104376_store123", currency: "USD", price: 89.95, lastUpdated: Date("2014/04/01"), // last update time … } _id: concatenation of item and store. Store: can be a store group or store id. Item: can be an item id or sku Indices: lastUpdated Merchandising – Pricing
  • 21 • Get all prices for a given item db.prices.find( { _id: /^p301671_/ ) • Get all prices for a given sku (price could be at item level) db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ]) • Get minimum and maximum prices for a sku db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price }, max: { $max : price} } }) • Get price for a sku and store id (returns up to 4 prices) db.prices.find( { _id: { $in: [ "sku730223104376_store1234", "sku730223104376_sgroup0", "p301671_store1234", "p301671_sgroup0"] , { price: 1 }) Merchandising - Pricing
  • 22 • The hierarchy of items typically follows: • Company – Division: • Department: Women's shoe store – Class: Pumps »Item: Guess classic pump • Variation: size 6 black Merchandising – Product Hierarchy
  • 24 Merchandising – Browse and Search products Browse by category Special Lists Filter by attributes Lists hundreds of item summaries Ideally a single query is issued to the database to obtain all items and metadata to display
  • 25 The previous page presents many challenges: • Response is needed within milliseconds for hundreds of items • Faceted search on many attributes of an item: department, brand, category, etc • Attributes to match may be at the variation level: color, size, etc, in which case the variation should be shown • One item may have thousands of variations. Only one item should be displayed even if many variations match • Efficient sorting on several attributes: price, popularity • Pagination feature which requires deterministic ordering Merchandising – Browse and Search products
  • 26 Merchandising – Browse and Search products Hundreds of sizes One Item Dozens of colors A single item may have thousands of variations
  • 27 Merchandising – Browse and Search products Images of the matching variations are displayed Hierarchy Sort parameter Faceted Search
  • 28 Merchandising – Traditional Architecture Relational DB System of Records Full Text Search Engine Indexing #1 obtain search results IDs ApplicationCache #2 obtain objects by ID Pre-joined into objects
  • 29 The traditional architecture presents issues: • 3 different systems to maintain: RDBMS, Search engine, Caching layer • A search returns a list of IDs which then are looked up in the cache as a batch or one by one. It significantly increases latency of response • RDBMS schema is complex and static • The search index needs to be refreshed at intervals • Setup does not allow efficient pagination Merchandising – Traditional Architecture
  • 30 MongoDB Data Store Merchandising - Architecture Product Summaries Product Definitions Pricing Promotions Product Variations Ratings & Reviews #1 Obtain results
  • 31 The product index relies on the following parameters: • The department (required): the main component of category, e.g. "Shoes" • An indexed attribute (optional) – Category path, e.g. "Shoes/Women/Pumps" – Price range (based on online prices) – List of Item Attributes, e.g. Brand = Guess – List of Variation Attributes, e.g. Color = red • A non-indexed attribute (optional) – List of Item Secondary Attributes, e.g. Style = Designer – List of Variation Secondary Attributes, e.g. heel height = 5.0 • As well as Sorting, e.g. Price Low to High Merchandising – Product Summaries
  • 32 > db.summaries.findOne() { "_id": "p39", "title": "Evening Platform Pumps 39", "department": "Shoes", "category": "Shoes/Women/Pumps", "thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg", "price": 145.99, "rating": 0.95, "attrs": [ { "brand" : "Guess"}, … ], "sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …], "vars": [ { "sku": "sku2441", "thumbnail": "http://cdn…/pump-small-39.jpg.Blue", "image": "http://cdn…/pump-39.jpg.Blue", "attrs": [ { "size": 6.0 }, { "color": "Blue" }, …], "sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …], }, … Many more skus … ] } Indices: vars.sku, department + attr + category, department + vars.attrs + category, department + category, department + price, department + rating Merchandising – Product Summaries
  • 33 • Get summary from item id db.variation.find({ _id: "p301671" }) • Get summary's specific variation from SKU db.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } ) • Get summary by department, sorted by rating db.variation.find( { department: "Shoes" } ).sort( { rating: 1 } ) • Get summary with mix of parameters db.variation.find( { department : "Shoes" , "vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" : 180.99 } } ) Merchandising - Product Summaries
  • 34 Merchandising – Query stats Department Category Price Primary attribute Time Average (ms) 90th (ms) 95th (ms) 1 0 0 0 2 3 3 1 1 0 0 1 2 2 1 0 1 0 1 2 3 1 1 1 0 1 2 2 1 0 0 1 0 1 2 1 1 0 1 0 1 1 1 0 1 1 1 2 2 1 1 1 1 0 1 1 1 0 0 2 1 3 3 1 1 0 2 0 2 2 1 0 1 2 10 20 35 1 1 1 2 0 1 1
  • Content
  • 36 Content Content MongoDB Metadata Asset Repository Digital Right Mgt Access Control Processing / Encoding
  • Inventory
  • 38 Inventory Inventory MongoDB External Inventory Internal Inventory Regional Inventory Purchase Orders Fulfillment Promotions
  • 39 Demonstration Document Model Definitions • id: p0 Variations • id: sku0 • pId: p0 Summary • id: p0 • vars: [sku0, sku1, …] Stores • id: s1 • Loc: [22, 33] Inventory • store: s1 • pId: p0 • vars: [{sku: sku0, q: 3}, {sku: sku2, q: 2}] Product
  • 40 db.stores.findOne() { "_id" : ObjectId("53549fd3e4b0aaf5d6d07f35"), "className" : "catalog.Store", "storeId" : "store0", "name" : "Bessemer store", "address" : { "addr1" : "1st Main St", "city" : "Bessemer", "state" : "AL", "zip" : "12345", "country" : "US" }, "location" : [ -86.95444, 33.40178 ] … } Inventory - Stores
  • 41 • Get a store by storeId db.stores.find({ productId: "301671" }) • Get nearby stores sorted by distance db.stores.runCommand({ "geoNear" : "stores" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true} Inventory - Stores
  • 42 > db.inventory.findOne() { "_id": "5354869f300487d20b2b011d", "storeId": "store0", "location": [ -86.95444, 33.40178 ], "productId": "p0", "vars": [ { "sku": "sku1", "q": 14 }, { "sku": "sku3", "q": 7 }, { "sku": "sku7", "q": 32 }, { "sku": "sku14", "q": 65 }, ... ] } Inventory - Quantities
  • 43 • Get all items in a store db.inventory.find({ storeId: "store100" }) • Get quantity for an item at a store db.inventory.find({ storeId: "store100", productId: "p200" }) • Get quantity for a sku at a store db.inventory.find( { storeId: "store100", productId: "p200", "vars.sku": "sku11736" }, { "vars.$": 1 }) • Increment / decrement inventory for an item at a store db.inventory.update( { storeId: "store100", productId: "p200", "vars.sku": "sku11736" }, { $inc: { "vars.$.q": 20 } }) • Indices: productId, storeId + productId, location (geo) + productId Inventory - Stores
  • 44 • Aggregate total quantity for an item db.inventory.aggregate([ { $match: { productId: "p200" }}, { $unwind: "$vars" }, { $group: { _id: "result", count: {$sum: 1} } }]) { "_id" : "result", "count" : 101752 } • Aggregate total quantity for a store db.inventory.aggregate([ { $match: { storeId: "store100" }}, { $unwind: "$vars" }, { $group: { _id: "result", count: {$sum: 1} } }]) { "_id" : "result", "count" : 29347 } Inventory - Stores
  • 45 • Get inventory for an item near a point db.runCommand( { "geoNear" : "inventory" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true, limit: 10, query: { productId: "p200", "vars.sku": "sku11736" }}) • Get closest store with available sku db.runCommand( { "geoNear" : "inventory" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true, limit: 10, query: { productId: "p200", vars: { $elemMatch: { "sku": "sku11736", q: { $gt: 0 } }}}}}) Inventory - Stores
  • Customer
  • 47 Customer Customer MongoDB Profile Market Segment Demographics Wish List Preference Inbox Sales / Support Chat Content Subscription
  • Channels
  • 49 Channels Channels MongoDB Location Store Assortment Point of Sale Channel Definition Planogram
  • Sales & Fulfillment
  • 51 Sales & Fulfillment Sales & Fulfillment MongoDB Sales Transaction Shipping Tracking Return & Exchange Business Rule Audit Shopping Cart
  • Insight
  • 53 Insight Insight MongoDB Advertising metrics Clickstream Recommendations Session Capture Activity Logging Geo Tracking Product Analytics Customer Insight Application Logs
  • 54 • Many user activities can be of interest: – Search – Product view, like or wish – Shopping cart add / remove – Sharing on social network – Ad impression, Clickstream • Those will be used to compute: – Product Map (relationships, etc) – User Preferences – Recommendations – Trends Activity Logging – Data of interest
  • 55 Activity logging - Architecture MongoDB HVDF API Activity Logging User History External Analytics: Hadoop, Spark, Storm, … User Preferences Recommendations Trends Product Map Apps Internal Analytics: Aggregation, MR All user activity is recorded MongoDB – Hadoop Connector Personalization
  • 56 Activity Logging
  • 57 • You need to store and manage an incoming stream of data samples (views, impressions, orders, …) – High arrival rate of data from many sources – Variable schema of arriving data – You need to control retention period of data • You need to compute derivative data sets based on these samples – Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries • You need low latency access to up-to-date data (user history) – Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples Activity Logging – Problem statement
  • 58 Activity logging - Requirements Requirement MongoDB Ingestion of 100ks of writes / sec Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning. Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted. Fast querying on varied fields, sorting Secondary Btree indexes can lookup and sort the data in milliseconds. Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.
  • 59 Activity Logging using HVDF HVDF (High Volume Data Feed): • Open source reference implementation of high volume writing with MongoDB • Rest API server written in Java with most popular libraries • Public project, issues can be logged • Can be run as-is, or customized as needed
  • 60 Feed High volume data feed architecture Channel Sample Sample Sample Sample Source Source Processor Inline Processing Batch Processing Stream Processing The Channel is the sequence of data samples that a sensor sends into the platform. Sources send samples into the Channel Processors generate derivative Channels from other Channel data
  • 61 HVDF -- High Volume Data Feed engine HVDF – Reference implementation REST Service API Processor Plugins Inline Batch Stream Channel Data Storage Raw Channel Data Aggregated Rollup T1 Aggregated Rollup T2 Query Processor Streaming spout Custom Stream Processing Logic Incoming Sample Stream POST /feed/channel/data GET /feed/channeldata?time=XX X&range=YYY Real-time Queries
  • 62 { _id: ObjectId(), geoCode: 1, // used to localize write operations sessionId: "2373BB…", device: { id: "1234", type: "mobile/iphone", userAgent: "Chrome/34.0.1847.131" } type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity itemId: "301671", sku: "730223104376", order: { id: "12520185", … }, location: [ -86.95444, 33.40178 ], tags: [ "smartphone", "iphone", … ], // associated tags timeStamp: Date("2014/04/01 …") } User Activity - Model
  • 63 Dynamic schema for sample data Sample 1 { deviceId: XXXX, time: Date(…) type: "VIEW", … } Channel Sample 2 { deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, … } Sample 3 { deviceId: XXXX, time: Date(…) type: “FB_LIKE” } Each sample can have variable fields
  • 64 Channels are sharded Shard Shard Shard Shard Shard Shard Key: Customer_id Sample { customer_id: XXXX, time: Date(…) type: "VIEW", } Channel You choose how to partition samples Samples can have dynamic schema Scale horizontally by adding shards Each shard is highly available
  • 65 Channels are time partitioned Channel Sample Sample Sample Sample Sample Sample Sample Sample - 2 days - 1 Day Today Partitioning keeps indexes manageable This is where all of the writes happen Older partitions are read only for best possible concurrency Queries are routed only to needed partitions Partition 1 Partition 2 Partition N Each partition is a separate collection Efficient and space reclaiming purging of old data
  • 66 Dynamic queries on Channels Channel Sample Sample Sample Sample App App App Indexes Queries Pipelines Map-Reduce Create custom indexes on Channels Use full mongodb query language to access samples Use mongodb aggregation pipelines to access samples Use mongodb inline map-reduce to access samples Full access to field, text, and geo indexing
  • 67 North America - West North America - East Europe Geographically distributed system Channel Sample Sample Sample Sample Source Source Source Source Source Source Sample Sample Sample Sample Geo shards per location Clients write local nodes Single view of channel available globally
  • 68 Insight
  • 69 Insight – Useful Data • Useful data for better shopping: – User history (e.g. recently seen products) – User statistics (e.g. total purchases, visits) – User interests (e.g. likes videogames and SciFi) – User social network – Cross-selling: people who bought this item had tendency to buy those other items (e.g. iPhone, then bought iPhone case) – Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)
  • 70 Example of real-time aggregation with Agg Framework User Activity – Computing User Stats
  • 71 Example of real-time aggregation with Agg Framework User Activity – Computing User Stats
  • 72 Let's simplify each activity recorded as the following: { userId: 123, type: order, itemId: 2, time } { userId: 123, type: order, itemId: 3, time } { userId: 234, type: order, itemId: 7, time } To calculate items bought by a user for a period of time, let's use MongoDB's Map Reduce: - Match activities of type "order" for the past 2 weeks - map: emit the document by userId - reduce: push all itemId in a list - Output looks like { _id: userId, items: [2, 3, 8] } User Activity – Items frequently bought together
  • 73 Then run a 2nd mapreduce job that for each of the previous results: - map: emits every combination of 2 items, starting with lowest itemId - reduce: sum up the total. - output looks like { _id: { a: 2, b: 3 } , count: 36 } User Activity – Items frequently bought together
  • 74 The output collection can then be queried per item Id and sorted by count, and cutoff at a threshold. Need of index on { _id.a, count } and { _id.b, count } You then obtain an affiliation collection with docs like: { itemId: 2, affil: [ { id: 3, weight: 36}, { id: 8, weight: 23} ] } User Activity – Items frequently bought together
  • 75 Example of Hadoop integration User Activity – Hadoop integration
  • Social
  • 77 Social Social MongoDB Social Channels User Network Activity Chat Social Profiles Community Mgt Rewards / Gamification
  • Conclusion
  • Appendix
  • 83 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Single View of Product Cluster Topology
  • 84 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DCPrimary node replicates data to all secondaries in the shard as fast as possible Single View of Product Cluster Topology
  • 85 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Center Shard contains all the data for stores in Center region Single View of Product Cluster Topology
  • 86 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Center Shard contains all the data for stores in Center region Local writes enable very high throughput of updates Single View of Product Cluster Topology
  • 87 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Each region is able to see the data of all stores from its “local” DC. Single View of Product Cluster Topology
  • 88 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Two nodes in each DC for painless maintenance with zero downtime Single View of Product Cluster Topology
  • 89 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Even if a DC goes out, the database remains fully available thanks to automated failover Single View of Product Cluster Topology
  • 90 West DC Primary Primary Primary Shard “West” Shard “Center” Shard “East” Center DC East DC Data set can grow, shards can add up, without any rewrite of the application code Single View of Product Cluster Topology
  • Thank You! Antoine Girbal Senior Solutions Engineer, MongoDB Inc. @antoinegirbal