Retail Reference Architecture
 

Like this? Share it with your network

Share

Retail Reference Architecture

on

  • 1,454 views

 

Statistics

Views

Total Views
1,454
Views on SlideShare
623
Embed Views
831

Actions

Likes
4
Downloads
78
Comments
0

5 Embeds 831

https://www.mongodb.com 463
http://www.mongodb.com 334
https://twitter.com 22
https://live.mongodb.com 11
http://news.google.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • How does eventual consistency fit into the idea of inventory? Is something in stock or out of stock? <br /> Items on hand matters at order time. What about at buying time? <br /> <br /> Are we pitching this as the system of record for inventory or as a single view on top of multiple, discrete inventory systems?
  • In a single view sort of application, where we’re designing for many use-cases instead of a single application, how do we handle schema design trade-offs?
  • Challenges for every service/component: <br /> Schema <br /> Indexing <br /> Sharding <br /> <br /> Most important criteria: <br /> User facing latency <br /> Linear scaling of services
  • Not shown on this slide: <br /> Audit collection <br /> Assortments – list of items in an order that a shop is going to make (backorder?) <br /> Shipments – going to stores <br /> one sku per item <br /> fast reading / writing to support updating the inventor in real time <br /> Make this like slide 30, drop the fields, just show the collection relations <br />
  • Fix stream box. Add validator box.

Retail Reference Architecture Presentation Transcript

  • 1. Retail Reference Architecture with MongoDB Antoine Girbal Principal Solutions Engineer, MongoDB Inc. @antoinegirbal
  • 2. Introduction
  • 3. 4 • it is way too broad to tackle with one solution • data maps so well to the document model • needs for agility, performance and scaling • Many (e)retailers are already using MongoDB • Let's define the best ways and places for it! Retail solution
  • 4. 5 • Holds complex JSON structures • Dynamic Schema for Agility • complex querying and in-place updating • Secondary, compound and geo indexing • full consistency, durability, atomic operations • Near linear scaling via sharding • Overall, MongoDB is a unique fit! MongoDB is a great fit
  • 5. 6 MongoDB Strategic Advantages Horizontally Scalable -Sharding Agile Flexible High Performance & Strong Consistency Application Highly Available -Replica Sets { customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}
  • 6. 7 build your data to fit your application Relational MongoDB { customer_id : 1, name : "Mark Smith", city : "San Francisco", orders: [ { order_number : 13, store_id : 10, date: “2014-01-03”, products: [ {SKU: 24578234, Qty: 3, Unit_price: 350}, {SKU: 98762345, Qty: 1, Unit_Price: 110} ] }, { <...> } ] } CustomerID First Name Last Name City 0 John Doe New York 1 Mark Smith San Francisco 2 Jay Black Newark 3 Meagan White London 4 Edward Danields Boston Order Number Store ID Product Customer ID 10 100 Tablet 0 11 101 Smartphone 0 12 101 Dishwasher 0 13 200 Sofa 1 14 200 Coffee table 1 15 201 Suit 2
  • 7. 8 Notions RDBMS MongoDB Database Database Table Collection Row Document Column Field
  • 8. Retail Components Overview
  • 9. 10 Information Management Merchandising Content Inventory Customer Channel Sales & Fulfillment Insight Social Architecture Overview Customer Channels Amazon Ebay … Stores POS Kiosk … Mobile Smartphone Tablet Website Contact Center API Data and Service Integration Social Facebook Twitter … Data Warehouse Analytics Supply Chain Management System Suppliers 3rd Party In Network Web Servers Application Servers
  • 10. 11 Commerce Functional Components Information Layer Look & Feel Navigation Customization Personalization Branding Promotions Chat Ads Customer's Perspective Research Browse Search Select Shopping Cart Purchase Checkout Receive Track Use Feedback Maintain Dialog Assist Market / Offer Guide Offer Semantic Search Recommend Rule-based Decisions Pricing Coupons Sell / Fullfill Orders Payments Fraud Detection Fulfillment Business Rules Insight Session Capture Activity Monitoring Customer Enterprise Information Management Merchandising Content Inventory Customer Channel Sales & Fulfillment Insight Social
  • 11. Merchandising
  • 12. 13 Merchandising Merchandising MongoDB Variant Hierarchy Pricing Promotions Ratings & Reviews Calendar Semantic Search Item Localization
  • 13. 14 • Single view of a product, one central catalog service • Read volume high and sustained, 100k reads / s • Write volume spikes up during catalog update • Advanced indexing and querying • Geographical distribution and low latency • No need for a cache layer, CDN for assets Merchandising - principles
  • 14. 15 Merchandising - requirements Requirement Example Challenge MongoDB Single-view of product Blended description and hierarchy of product to ensure availability on all channels Flexible document-oriented storage High sustained read volume with low latency Constant querying from online users and sales associates, requiring immediate response Fast indexed querying, replication allows local copy of catalog, sharding for scaling Spiky and real-time write volume Bulk update of full catalog without impacting production, real-time touch update Fast in-place updating, real- time indexing, , sharding for scaling Advanced querying Find product based on color, size, description Ad-hoc querying on any field, advanced secondary and compound indexing
  • 15. 16 Merchandising - Product Page Product images General Informatio n List of Variants External Informatio n Localized Description
  • 16. 17 > db.item.findOne() { _id: "301671", // main item id department: "Shoes", category: "Shoes/Women/Pumps", brand: "Guess", thumbnail: "http://cdn…/pump.jpg", image: "http://cdn…/pump1.jpg", // larger version of thumbnail title: "Evening Platform Pumps", description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit", shortDescription: "Evening Platform Pumps", style: "Designer", type: "Platform", rating: 4.5, // user rating lastUpdated: Date("2014/04/01"), // last update time … } Merchandising - Item Model
  • 17. 18 • Get item by id db.definition.findOne( { _id: "301671" } ) • Get item from Product Ids db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } ) • Get items by department db.definition.find({ department: "Shoes" }) • Get items by category prefix db.definition.find( { category: /^Shoes/Women/ } ) • Indices productId, department, category, lastUpdated Merchandising - Item Definition
  • 18. 19 > db.variant.findOne() { _id: "730223104376", // the sku itemId: "301671", // references item id thumbnail: "http://cdn…/pump-red.jpg", // variant specific image: "http://cdn…/pump-red.jpg", size: 6.0, color: "Red", width: "B", heelHeight: 5.0, lastUpdated: Date("2014/04/01"), // last update time … } Merchandising – Variant Model
  • 19. 20 • Get variant from SKU db.variation.find( { _id: "730223104376" } ) • Get all variants for a product, sorted by SKU db.variation.find( { productId: "301671" } ).sort( { _id: 1 } ) • Indices productId, lastUpdated Merchandising – Variant Model
  • 20. 22 Per store Pricing could result in billions of documents, unless you build it in a modular way Price: { _id: "sku730223104376_store123", currency: "USD", price: 89.95, lastUpdated: Date("2014/04/01"), // last update time … } _id: concatenation of item and store. Item: can be an item id or sku Store: can be a store group or store id. Indices: lastUpdated Merchandising – per store Pricing
  • 21. 23 • Get all prices for a given item db.prices.find( { _id: /^p301671_/ ) • Get all prices for a given sku (price could be at item level) db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ]) • Get minimum and maximum prices for a sku db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price }, max: { $max : price} } }) • Get price for a sku and store id (returns up to 4 prices) db.prices.find( { _id: { $in: [ "sku730223104376_store1234", "sku730223104376_sgroup0", "p301671_store1234", "p301671_sgroup0"] , { price: 1 }) Merchandising – per store Pricing
  • 22. 26 Merchandising – Browse and Search products Browse by category Special Lists Filter by attributes Lists hundreds of item summaries Ideally a single query is issued to the database to obtain all items and metadata to display
  • 23. 27 The previous page presents many challenges: • Response within milliseconds for hundreds of items • Faceted search on many attributes: category, brand, … • Attributes at the variant level: color, size, etc, and the variation's image should be shown • thousands of variants for an item, need to de-duplicate • Efficient sorting on several attributes: price, popularity • Pagination feature which requires deterministic ordering Merchandising – Browse and Search products
  • 24. 28 Merchandising – Browse and Search products Hundreds of sizes One Item Dozens of colors A single item may have thousands of variants
  • 25. 29 Merchandising – Browse and Search products Images of the matching variants are displayed Hierarchy Sort parameter Faceted Search
  • 26. 30 Merchandising – Traditional Architecture Relational DB System of Records Full Text Search Engine Indexing #1 obtain search results IDs ApplicationCache #2 obtain objects by ID Pre-joined into objects
  • 27. 31 The traditional architecture issues: • 3 different systems to maintain: RDBMS, Search engine, Caching layer • search returns a list of IDs to be looked up in the cache, increases latency of response • RDBMS schema is complex and static • The search index is expensive to update • Setup does not allow efficient pagination Merchandising – Traditional Architecture
  • 28. 32 MongoDB Data Store Merchandising - Architecture SummariesItems Pricing PromotionsVariants Ratings & Reviews #1 Obtain results
  • 29. 33 The summary relies on the following parameters: • department e.g. "Shoes" • An indexed attribute – Category path, e.g. "Shoes/Women/Pumps" – Price range – List of Item Attributes, e.g. Brand = Guess – List of Variant Attributes, e.g. Color = red • A non-indexed attribute – List of Item Secondary Attributes, e.g. Style = Designer – List of Variant Secondary Attributes, e.g. heel height = 4.0 • Sorting, e.g. Price Low to High Merchandising – Summary Model
  • 30. 34 > db.summaries.findOne() { "_id": "p39", "title": "Evening Platform Pumps 39", "department": "Shoes", "category": "Shoes/Women/Pumps", "thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg", "price": 145.99, "rating": 0.95, "attrs": [ { "brand" : "Guess"}, … ], "sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …], "vars": [ { "sku": "sku2441", "thumbnail": "http://cdn…/pump-small-39.jpg.Blue", "image": "http://cdn…/pump-39.jpg.Blue", "attrs": [ { "size": 6.0 }, { "color": "Blue" }, …], "sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …], }, … Many more skus … ] } Merchandising – Summary Model
  • 31. 35 • Get summary from item id db.variation.find({ _id: "p301671" }) • Get summary's specific variation from SKU db.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } ) • Get summary by department, sorted by rating db.variation.find( { department: "Shoes" } ).sort( { rating: 1 } ) • Get summary with mix of parameters db.variation.find( { department : "Shoes" , "vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" : 180.99 } } ) Merchandising - Summary Model
  • 32. 36 Merchandising – Summary Model • The following indices are used: – department + attr + category + _id – department + vars.attrs + category + _id – department + category + _id – department + price + _id – department + rating + _id • _id used for pagination • Can take advantage of index intersection • With several attributes specified (e.g. color=red and size=6), which one is looked up?
  • 33. 37 Facet samples: { "_id" : "Accessory Type=Hosiery" , "count" : 14} { "_id" : "Ladder Material=Steel" , "count" : 2} { "_id" : "Gold Karat=14k" , "count" : 10138} { "_id" : "Stone Color=Clear" , "count" : 1648} { "_id" : "Metal=White gold" , "count" : 10852} Single operations to insert / update: db.facet.update( { _id: "Accessory Type=Hosiery" }, { $inc: 1 }, true, false) The facet with lowest count is the most restrictive… It should come first in the query! Merchandising – Facet
  • 34. 38 Merchandising – Query stats Department Category Price Primary attribute Time Average (ms) 90th (ms) 95th (ms) 1 0 0 0 2 3 3 1 1 0 0 1 2 2 1 0 1 0 1 2 3 1 1 1 0 1 2 2 1 0 0 1 0 1 2 1 1 0 1 0 1 1 1 0 1 1 1 2 2 1 1 1 1 0 1 1 1 0 0 2 1 3 3 1 1 0 2 0 2 2 1 0 1 2 10 20 35 1 1 1 2 0 1 1
  • 35. Inventory
  • 36. 42 Inventory – Traditional Architecture Relational DB System of Records Nightly Batches Analytics, Aggregations, Reports Caching Layer Field Inventory Internal & External Apps Point-in-time Loads
  • 37. 43 Opportunities Missed • Can’t reliability detect availability • Can't redirect purchasers to in-store pickup • Can’t do intra-day replenishment • Degraded customer experience • Higher internal expense
  • 38. 44 Inventory – Principles • Single view of the inventory • Used by most services and channels • Read dominated workload • Local, real-time writes • Bulk writes for refresh • Geographically distributed • Horizontally scalable
  • 39. 45 Inventory – Requirements Requirement Challenge MongoDB Single view of inventory Ensure availability of inventory information on all channels and services Developer-friendly, document-oriented storage High volume, low latency reads Anytime, anywhere access to inventory data without overloading the system of record Fast, indexed reads Local reads Horizontal scaling Bulk updates, intra-day deltas Provide window-in-time consistency for highly available services Bulk writes Fast, in-place updates Horizontal scaling Rapid application development cycles Deliver new services rapidly to capture new opportunities Flexible schema Rich query language Agile-friendly iterations
  • 40. 46 Inventory – Target Architecture Relational DB System of Records Analytics, Aggregations, Reports Field Inventory Internal & External Apps Inventory Assortments Shipments Audits Products Stores Point-in-time Loads Nightly Refresh Real-time Updates
  • 41. 47 Horizontal Scaling Inventory – Technical Decisions Store Inventory Schema Indexing
  • 42. 48 Inventory – Collections Stores Inventory Products Audits Assortmen ts Shipments
  • 43. 49 Stores – Sample Document • > db.stores.findOne() • { • "_id" : ObjectId("53549fd3e4b0aaf5d6d07f35"), • "className" : "catalog.Store", • "storeId" : "store0", • "name" : "Bessemer store", • "address" : { • "addr1" : "1st Main St", • "city" : "Bessemer", • "state" : "AL", • "zip" : "12345",
  • 44. 50 Stores – Sample Queries • Get a store by storeId db.stores.find({ "storeId" : "store0" }) • Get a store by zip code db.stores.find({ "address.zip" : "12345" })
  • 45. 51 What’s near me?
  • 46. 52 Stores – Sample Geo Queries • Get nearby stores sorted by distance db.runCommand({ geoNear : "stores", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true })
  • 47. 53 Stores – Sample Geo Queries • Get the five nearest stores within 10 km db.stores.find({ location : { $near : { $geometry : { type : "Point", coordinates : [-82.80, 40.09] }, $maxDistance : 10000 } } }).limit(5)
  • 48. 54 Stores – Indices • { "storeId" : 1 } • { "name" : 1 } • { "address.zip" : 1 } • { "location" : "2dsphere" }
  • 49. 55 Inventory – Sample Document • > db.inventory.findOne() • { • "_id": "5354869f300487d20b2b011d", • "storeId": "store0", • "location": [-86.95444, 33.40178], • "productId": "p0", • "vars": [ • { "sku": "sku1", "q": 14 }, • { "sku": "sku3", "q": 7 }, • { "sku": "sku7", "q": 32 }, • { "sku": "sku14", "q": 65 }, • ...
  • 50. 56 Inventory – Sample Queries • Get all items in a store db.inventory.find({ storeId : "store100" }) • Get quantity for an item at a store db.inventory.find({ "storeId" : "store100", "productId" : "p200" })
  • 51. 57 Inventory – Sample Queries • Get quantity for a sku at a store db.inventory.find( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "vars.$" : 1 } )
  • 52. 58 Inventory – Sample Update • Increment / decrement inventory for an item at a store db.inventory.update( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "$inc" : { "vars.$.q" : 20 } } )
  • 53. 59 Inventory – Sample Aggregations • Aggregate total quantity for a product db.inventory.aggregate( [ { $match : { productId : "p200" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] ) { "_id" : "result", "count" : 101752 }
  • 54. 60 Inventory – Sample Aggregations • Aggregate total quantity for a store db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $match : { "vars.q" : { $gt : 0 } } }, { $group : { _id : "result", count : { $sum : 1 } } } ] ) { "_id" : "result", "count" : 29347 }
  • 55. 61 Inventory – Sample Aggregations • Aggregate total quantity for a store db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] ) { "_id" : "result", "count" : 29347 }
  • 56. 63
  • 57. 64 Inventory – Sample Geo-Query • Get inventory for an item near a point db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true, limit : 10, query : { "productId" : "p200", "vars.sku" : "sku11736" } } )
  • 58. 65 Inventory – Sample Geo-Query • Get closest store with available sku db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.800672, 40.090844] }, maxDistance : 10000.0, spherical : true, limit : 1, query : { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : { $gt : 0 } } } } } )
  • 59. 66 Inventory – Sample Geo-Aggregation • Get count of inventory for an item near a point db.inventory.aggregate( [ { $geoNear: { near : { type : "Point", coordinates : [-82.800672, 40.090844] }, distanceField: "distance", maxDistance: 10000.0, spherical : true, query: { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : {$gt : 0} } } }, includeLocs: "dist.location", num: 5 } }, { $unwind: "$vars" }, { $match: { "vars.sku" : "sku11736" } }, { $group: { _id: "result", count: {$sum: "$vars.q"} } }])
  • 60. 67 Inventory – Sample Indices • { storeId : 1 } • { productId : 1, storeId : 1 } • Why not "vars.sku"? – { productId : 1, storeId : 1, "vars.sku" : 1 } • { productId : 1, location : "2dsphere" }
  • 61. 68 Horizontal Scaling Inventory – Technical Decisions Store Inventory Schema Indexing
  • 62. 69 Shar d East Shar d Centr al Shar d West East DC Inventory – Sharding Topology West DC Central DC Legacy Inventory Primary Primary Primary
  • 63. 70 Inventory – Shard Key • Choose shard key – { productId : 1, storeId : 1 } • Set up sharding – sh.enableSharding("inventoryDB") – sh.shardCollection( "inventoryDB.inventory", { productId : 1, storeId : 1 } )
  • 64. 71 Inventory – Shard Tags • Set up shard tags – sh.addShardTag("shard0000", "west") – sh.addShardTag("shard0001", "central") – sh.addShardTag("shard0002", "east") • Set up tag ranges – Add new field: region – sh.addTagRange("inventoryDB.inventory", { region : 0 }, { region : 100}, "west" ) – sh.addTagRange("inventoryDB.inventory", { region : 100 }, { region : 200 }, "central" ) – sh.addTagRange("inventoryDB.inventory", { region : 200 }, { region : 300 }, "east" )
  • 65. Insight
  • 66. 87 Insight Insight MongoDB Advertising metrics Clickstream Recommendations Session Capture Activity Logging Geo Tracking Product Analytics Customer Insight Application Logs
  • 67. 88 Many user activities can be of interest: • Search • Product view, like or wish • Shopping cart add / remove • Sharing on social network • Ad impression, Clickstream Activity Logging – Data of interest
  • 68. 89 Will be used to compute: • Product Map (relationships, etc) • User Preferences • Recommendations • Trends … Activity Logging – Data of interest
  • 69. 90 Activity logging - Architecture MongoDB HVDF API Activity Logging User History External Analytics: Hadoop, Spark, Storm, … User Preferences Recommendations Trends Product Map Apps Internal Analytics: Aggregation, MR All user activity is recorded MongoDB – Hadoop Connector Personalization
  • 70. 91 Activity Logging
  • 71. 92 • store and manage an incoming stream of data samples – High arrival rate of data from many sources – Variable schema of arriving data – control retention period of data • compute derivative data sets based on these samples – Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries • low latency access to up-to-date data (user history) – Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples Activity Logging – Problem statement
  • 72. 93 Activity logging - Requirements Requirement MongoDB Ingestion of 100ks of writes / sec Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning. Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted. Fast querying on varied fields, sorting Secondary Btree indexes can lookup and sort the data in milliseconds. Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.
  • 73. 94 Activity Logging using HVDF HVDF (High Volume Data Feed): • Open source reference implementation of high volume writing with MongoDB https://github.com/10gen-labs/hvdf • Rest API server written in Java with most popular libraries • Public project, issues can be logged https://jira.mongodb.org/browse/HVDF • Can be run as-is, or customized as needed
  • 74. 95 Feed High volume data feed architecture Channel Sample Sample Sample Sample Source Source Processor Inline Processing Batch Processing Stream Processing Grouping by Feed and Channel Sources send samples Processors generate derivative Channels
  • 75. 96 HVDF -- High Volume Data Feed engine HVDF – Reference implementation REST Service API Processor Plugins Inline Batch Stream Channel Data Storage Raw Channel Data Aggregated Rollup T1 Aggregated Rollup T2 Query Processor Streaming spout Custom Stream Processing Logic Incoming Sample Stream POST /feed/channel/data GET /feed/channeldata?time=XX X&range=YYY Real-time Queries
  • 76. 97 { _id: ObjectId(), geoCode: 1, // used to localize write operations sessionId: "2373BB…", device: { id: "1234", type: "mobile/iphone", userAgent: "Chrome/34.0.1847.131" } userId: "u123", type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity itemId: "301671", sku: "730223104376", order: { id: "12520185", … }, location: [ -86.95444, 33.40178 ], tags: [ "smartphone", "iphone", … ], // associated tags timeStamp: Date("2014/04/01 …") } User Activity - Model
  • 77. 98 Dynamic schema for sample data Sample 1 { deviceId: XXXX, time: Date(…) type: "VIEW", … } Channel Sample 2 { deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, … } Sample 3 { deviceId: XXXX, time: Date(…) type: “FB_LIKE” } Each sample can have variable fields
  • 78. 99 Channels are sharded Shard Shard Shard Shard Shard Shard Key: Customer_id Sample { customer_id: XXXX, time: Date(…) type: "VIEW", } Channel You choose how to partition samples Samples can have dynamic schema Scale horizontally by adding shards Each shard is highly available
  • 79. 100 Channels are time partitioned Channel Sample Sample Sample Sample Sample Sample Sample Sample - 2 days - 1 Day Today Partitioning keeps indexes manageable This is where all of the writes happen Older partitions are read only for best possible concurrency Queries are routed only to needed partitions Partition 1 Partition 2 Partition N Each partition is a separate collection Efficient and space reclaiming purging of old data
  • 80. 101 Dynamic queries on Channels Channel Sample Sample Sample Sample App App App Indexes Queries Pipelines Map-Reduce Create custom indexes on Channels Use full mongodb query language to access samples Use mongodb aggregation pipelines to access samples Use mongodb inline map-reduce to access samples Full access to field, text, and geo indexing
  • 81. 102 North America - West North America - East Europe Geographically distributed system Channel Sample Sample Sample Sample Source Source Source Source Source Source Sample Sample Sample Sample Geo shards per location Clients write local nodes Single view of channel available globally
  • 82. 103 Insight
  • 83. 104 Insight – Useful Data Useful data for better shopping: • User history (e.g. recently seen products) • User statistics (e.g. total purchases, visits) • User interests (e.g. likes videogames and SciFi) • User social network
  • 84. 105 Insight – Useful Data Useful data for selling more: • Cross-selling: people who bought this item had tendency to buy those other items (e.g. iPhone, then bought iPhone case) • Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)
  • 85. 106 • Get the recent activity for a user, to populate the "recently viewed" list db.activities.find({ userId: "u123", time: { $gt: DATE }}). sort({ time: -1 }).limit(100) • Get the recent activity for a product, to populate the "N users bought this in the past N hours" list db.activities.find({ itemId: "301671", time: { $gt: DATE }}). sort({ time: -1 }).limit(100) • Indices: time, userId + time, deviceId + time, itemId + time • All queries should be time bound, since this is a lot of data! Insight – User History
  • 86. 107 • Get the recent number of views, purchases, etc for a user db.activities.aggregate(([ { $match: { userId: "u123", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: 1} } }]) • Get the total recent sales for a user db.activities.aggregate(([ { $match: { userId: "u123", time: { $gt: DATE }, type: "ORDER" }}, { $group: { _id: "result", count: {$sum: "$totalPrice"} } }]) • Get the recent number of views, purchases, etc for an item db.activities.aggregate(([ { $match: { itemId: "301671", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: "1"} } }]) • Those aggregations are very fast, real-time Insight – User Stats
  • 87. 108 • number of activities for unique visitors for the past hour. Calculation of uniques is hard for any system! db.activities.aggregate(([ { $match: { time: { $gt: NOW-1H } }}, { $group: { _id: "$userId", count: {$sum: 1} } }], { allowDiskUse: 1 }) • Aggregation above can have issues (single shard final grouping, result not persisted). Map Reduce is a better alternative here var map = function() { emit(this.userId, 1); } var reduce = function(key, values) { return Array.sum(values); } db.activities.mapreduce(map, reduce, { query: { time: { $gt: NOW-1H } }, out: { replace: "lastHourUniques", sharded: true }) db.lastHourUniques.find({ userId: "u123" }) // number activities for a user db.lastHourUniques.count() // total uniques Insight – User Stats
  • 88. 109 User Activity – Items bought together Time to cross-sell!
  • 89. 110 Let's simplify each activity recorded as the following: { userId: "u123", type: order, itemId: 2, time: DATE } { userId: "u123", type: order, itemId: 3, time: DATE } { userId: "u234", type: order, itemId: 7, time: DATE } Calculate items bought by a user with Map Reduce: - Match activities of type "order" for the past 2 weeks - map: emit the document by userId - reduce: push all itemId in a list - Output looks like { _id: "u123", items: [2, 3, 8] } User Activity – Items bought together
  • 90. 111 Then run a 2nd mapreduce job from the previous output to compute the number of occurrences of each item combination: - query: go over all documents (1 document per userId) - map: emit every combination of 2 items, starting with lowest itemId - reduce: sum up the total. - output looks like { _id: { a: 2, b: 3 } , count: 36 } User Activity – Items bought together
  • 91. 112 Then obtain the most popular combinations per item: - Index created on { _id.a : 1, count: 1 } and { _id.b: 1, count: 1 } - Query with a threshold: - db.combinations.find( { _id.a: "u123", count: { $gt: 10 }} ).sort({ count: -1 }) - db.combinations.find( { _id.b: "u123", count: { $gt: 10 }} ).sort({ count: -1 }) Later we can create a more compact recommendation collection that includes popular combinations with weights, like: { itemId: 2, recom: [ { itemId: 32, weight: 36}, { itemId: 158, weight: 23}, … ] } User Activity – Items bought together
  • 92. 113 User Activity – Hadoop integration EDW Management&Monitoring Security&Auditing RDBM S CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Operational Analytical
  • 93. 114 Commerce Applications powered by Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  • 94. 115 Connector Overview Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR
  • 95. 116 Connector Features and Functionality • Open-source on github https://github.com/mongodb/mongo-hadoop • Computes splits to read data – Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive – MongoDB as a standard data source/destination • Support for – Filtering data with MongoDB queries – Authentication – Reading from Replica Set tags – Appending to existing collections
  • 96. 117 MapReduce Configuration • MongoDB input – mongo.job.input.format = com.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  • 97. 118 Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 98. 119 Hive Support CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler
  • 99. Thank You! Antoine Girbal Principal Solutions Engineer, MongoDB Inc. @antoinegirbal