This document discusses using MongoDB for inventory management in retail applications. Some key points:
- MongoDB allows for a single view of inventory across all channels with real-time updates and bulk writes for refresh. Its flexible schema and horizontal scaling are well-suited for inventory needs.
- Collections would include Stores, Inventory, Products, Audits, Assortments, and Shipments. Stores documents contain store-specific metadata.
- Inventory documents have embedded documents for products and variants with attributes like size and color. This embedded structure allows for efficient queries on combinations of attributes.
- The target architecture replaces traditional batch-based ETL with real-time updates to MongoDB for improved customer experience and business operations.
3. 4
• it is way too broad to tackle with one solution
• data maps so well to the document model
• needs for agility, performance and scaling
• Many (e)retailers are already using MongoDB
• Let's define the best ways and places for it!
Retail solution
4. 5
• Holds complex JSON structures
• Dynamic Schema for Agility
• complex querying and in-place updating
• Secondary, compound and geo indexing
• full consistency, durability, atomic operations
• Near linear scaling via sharding
• Overall, MongoDB is a unique fit!
MongoDB is a great fit
6. 7
build your data to fit your application
Relational MongoDB
{ customer_id : 1,
name : "Mark Smith",
city : "San Francisco",
orders: [ {
order_number : 13,
store_id : 10,
date: “2014-01-03”,
products: [
{SKU: 24578234,
Qty: 3,
Unit_price: 350},
{SKU: 98762345,
Qty: 1,
Unit_Price: 110}
]
},
{ <...> }
]
}
CustomerID First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Danields Boston
Order Number Store ID Product Customer ID
10 100 Tablet 0
11 101 Smartphone 0
12 101 Dishwasher 0
13 200 Sofa 1
14 200 Coffee table 1
15 201 Suit 2
13. 14
• Single view of a product, one central catalog service
• Read volume high and sustained, 100k reads / s
• Write volume spikes up during catalog update
• Advanced indexing and querying
• Geographical distribution and low latency
• No need for a cache layer, CDN for assets
Merchandising - principles
14. 15
Merchandising - requirements
Requirement Example Challenge MongoDB
Single-view of product Blended description and
hierarchy of product to
ensure availability on all
channels
Flexible document-oriented
storage
High sustained read
volume with low latency
Constant querying from
online users and sales
associates, requiring
immediate response
Fast indexed querying,
replication allows local copy
of catalog, sharding for
scaling
Spiky and real-time write
volume
Bulk update of full catalog
without impacting
production, real-time touch
update
Fast in-place updating, real-
time indexing, , sharding for
scaling
Advanced querying Find product based on
color, size, description
Ad-hoc querying on any
field, advanced secondary
and compound indexing
15. 16
Merchandising - Product Page
Product
images
General
Informatio
n
List of
Variants
External
Informatio
n
Localized
Description
16. 17
> db.item.findOne()
{ _id: "301671", // main item id
department: "Shoes",
category: "Shoes/Women/Pumps",
brand: "Guess",
thumbnail: "http://cdn…/pump.jpg",
image: "http://cdn…/pump1.jpg", // larger version of thumbnail
title: "Evening Platform Pumps",
description: "Those evening platform pumps put the perfect
finishing touches on your most glamourous night-on-the-town
outfit",
shortDescription: "Evening Platform Pumps",
style: "Designer",
type: "Platform",
rating: 4.5, // user rating
lastUpdated: Date("2014/04/01"), // last update time
… }
Merchandising - Item Model
17. 18
• Get item by id
db.definition.findOne( { _id: "301671" } )
• Get item from Product Ids
db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )
• Get items by department
db.definition.find({ department: "Shoes" })
• Get items by category prefix
db.definition.find( { category: /^Shoes/Women/ } )
• Indices
productId, department, category, lastUpdated
Merchandising - Item Definition
18. 19
> db.variant.findOne()
{
_id: "730223104376", // the sku
itemId: "301671", // references item id
thumbnail: "http://cdn…/pump-red.jpg", // variant
specific
image: "http://cdn…/pump-red.jpg",
size: 6.0,
color: "Red",
width: "B",
heelHeight: 5.0,
lastUpdated: Date("2014/04/01"), // last update time
…
}
Merchandising – Variant Model
19. 20
• Get variant from SKU
db.variation.find( { _id: "730223104376" } )
• Get all variants for a product, sorted by SKU
db.variation.find( { productId: "301671" } ).sort( { _id: 1 } )
• Indices
productId, lastUpdated
Merchandising – Variant Model
20. 22
Per store Pricing could result in billions of documents,
unless you build it in a modular way
Price: {
_id: "sku730223104376_store123",
currency: "USD",
price: 89.95,
lastUpdated: Date("2014/04/01"), // last update time
…
}
_id: concatenation of item and store.
Item: can be an item id or sku
Store: can be a store group or store id.
Indices: lastUpdated
Merchandising – per store Pricing
21. 23
• Get all prices for a given item
db.prices.find( { _id: /^p301671_/ )
• Get all prices for a given sku (price could be at item level)
db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ])
• Get minimum and maximum prices for a sku
db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },
max: { $max : price} } })
• Get price for a sku and store id (returns up to 4 prices)
db.prices.find( { _id: { $in: [ "sku730223104376_store1234",
"sku730223104376_sgroup0",
"p301671_store1234",
"p301671_sgroup0"] , { price: 1 })
Merchandising – per store Pricing
22. 26
Merchandising – Browse and Search products
Browse by
category
Special
Lists
Filter by
attributes
Lists hundreds
of item
summaries
Ideally a single query is issued to the database
to obtain all items and metadata to display
23. 27
The previous page presents many challenges:
• Response within milliseconds for hundreds of items
• Faceted search on many attributes: category, brand, …
• Attributes at the variant level: color, size, etc, and the
variation's image should be shown
• thousands of variants for an item, need to de-duplicate
• Efficient sorting on several attributes: price, popularity
• Pagination feature which requires deterministic ordering
Merchandising – Browse and Search products
24. 28
Merchandising – Browse and Search products
Hundreds
of sizes
One Item
Dozens of
colors
A single item may have thousands of variants
25. 29
Merchandising – Browse and Search products
Images of the matching
variants are displayed
Hierarchy
Sort
parameter
Faceted
Search
26. 30
Merchandising – Traditional Architecture
Relational DB
System of Records
Full Text Search
Engine
Indexing
#1 obtain
search
results IDs
ApplicationCache
#2 obtain
objects by
ID
Pre-joined
into objects
27. 31
The traditional architecture issues:
• 3 different systems to maintain: RDBMS, Search
engine, Caching layer
• search returns a list of IDs to be looked up in the cache,
increases latency of response
• RDBMS schema is complex and static
• The search index is expensive to update
• Setup does not allow efficient pagination
Merchandising – Traditional Architecture
29. 33
The summary relies on the following parameters:
• department e.g. "Shoes"
• An indexed attribute
– Category path, e.g. "Shoes/Women/Pumps"
– Price range
– List of Item Attributes, e.g. Brand = Guess
– List of Variant Attributes, e.g. Color = red
• A non-indexed attribute
– List of Item Secondary Attributes, e.g. Style = Designer
– List of Variant Secondary Attributes, e.g. heel height = 4.0
• Sorting, e.g. Price Low to High
Merchandising – Summary Model
31. 35
• Get summary from item id
db.variation.find({ _id: "p301671" })
• Get summary's specific variation from SKU
db.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )
• Get summary by department, sorted by rating
db.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )
• Get summary with mix of parameters
db.variation.find( { department : "Shoes" ,
"vars.attrs" : { "color" : "Gray"} ,
"category" : ^/Shoes/Women/ ,
"price" : { "$gte" : 65.99 , "$lte" : 180.99 } } )
Merchandising - Summary Model
32. 36
Merchandising – Summary Model
• The following indices are used:
– department + attr + category + _id
– department + vars.attrs + category + _id
– department + category + _id
– department + price + _id
– department + rating + _id
• _id used for pagination
• Can take advantage of index intersection
• With several attributes specified (e.g. color=red
and size=6), which one is looked up?
33. 37
Facet samples:
{ "_id" : "Accessory Type=Hosiery" , "count" : 14}
{ "_id" : "Ladder Material=Steel" , "count" : 2}
{ "_id" : "Gold Karat=14k" , "count" : 10138}
{ "_id" : "Stone Color=Clear" , "count" : 1648}
{ "_id" : "Metal=White gold" , "count" : 10852}
Single operations to insert / update:
db.facet.update( { _id: "Accessory Type=Hosiery" },
{ $inc: 1 }, true, false)
The facet with lowest count is the most restrictive…
It should come first in the query!
Merchandising – Facet
36. 42
Inventory – Traditional Architecture
Relational DB
System of Records
Nightly
Batches
Analytics,
Aggregations,
Reports
Caching
Layer
Field Inventory
Internal &
External Apps
Point-in-time
Loads
38. 44
Inventory – Principles
• Single view of the inventory
• Used by most services and channels
• Read dominated workload
• Local, real-time writes
• Bulk writes for refresh
• Geographically distributed
• Horizontally scalable
39. 45
Inventory – Requirements
Requirement Challenge MongoDB
Single view of
inventory
Ensure availability of
inventory information on
all channels and
services
Developer-friendly,
document-oriented
storage
High volume,
low latency reads
Anytime, anywhere
access to inventory
data without
overloading the system
of record
Fast, indexed reads
Local reads
Horizontal scaling
Bulk updates,
intra-day deltas
Provide window-in-time
consistency for highly
available services
Bulk writes
Fast, in-place updates
Horizontal scaling
Rapid application
development cycles
Deliver new services
rapidly to capture new
opportunities
Flexible schema
Rich query language
Agile-friendly iterations
40. 46
Inventory – Target Architecture
Relational DB
System of Records
Analytics,
Aggregations,
Reports
Field Inventory
Internal &
External Apps
Inventory
Assortments
Shipments
Audits
Products
Stores
Point-in-time
Loads
Nightly
Refresh
Real-time
Updates
44. 50
Stores – Sample Queries
• Get a store by storeId
db.stores.find({ "storeId" : "store0" })
• Get a store by zip code
db.stores.find({ "address.zip" : "12345" })
50. 56
Inventory – Sample Queries
• Get all items in a store
db.inventory.find({ storeId : "store100" })
• Get quantity for an item at a store
db.inventory.find({
"storeId" : "store100",
"productId" : "p200"
})
51. 57
Inventory – Sample Queries
• Get quantity for a sku at a store
db.inventory.find(
{
"storeId" : "store100",
"productId" : "p200",
"vars.sku" : "sku11736"
},
{ "vars.$" : 1 }
)
52. 58
Inventory – Sample Update
• Increment / decrement inventory for an item at
a store
db.inventory.update(
{
"storeId" : "store100",
"productId" : "p200",
"vars.sku" : "sku11736"
},
{ "$inc" : { "vars.$.q" : 20 } }
)
67. 88
Many user activities can be of interest:
• Search
• Product view, like or wish
• Shopping cart add / remove
• Sharing on social network
• Ad impression, Clickstream
Activity Logging – Data of interest
68. 89
Will be used to compute:
• Product Map (relationships, etc)
• User Preferences
• Recommendations
• Trends …
Activity Logging – Data of interest
69. 90
Activity logging - Architecture
MongoDB
HVDF
API
Activity Logging
User History
External
Analytics:
Hadoop,
Spark,
Storm,
…
User Preferences
Recommendations
Trends
Product Map
Apps
Internal
Analytics:
Aggregation,
MR
All user activity
is recorded
MongoDB –
Hadoop
Connector
Personalization
71. 92
• store and manage an incoming stream of data
samples
– High arrival rate of data from many sources
– Variable schema of arriving data
– control retention period of data
• compute derivative data sets based on these
samples
– Aggregations and statistics based on data
– Roll-up data into pre-computed reports and summaries
• low latency access to up-to-date data (user history)
– Flexible indexing of raw and derived data sets
– Rich querying based on time + meta-data fields in samples
Activity Logging – Problem statement
72. 93
Activity logging - Requirements
Requirement MongoDB
Ingestion of 100ks of
writes / sec
Fast C++ process, multi-threads, multi-locks. Horizontal
scaling via sharding. Sequential IO via time partitioning.
Flexible schema Dynamic schema, each document is independent. Data is
stored the same format and size as it is inserted.
Fast querying on varied
fields, sorting
Secondary Btree indexes can lookup and sort the data in
milliseconds.
Easy clean up of old data Deletes are typically as expensive as inserts. Getting free
deletes via time partitioning.
73. 94
Activity Logging using HVDF
HVDF (High Volume Data Feed):
• Open source reference implementation of high
volume writing with MongoDB
https://github.com/10gen-labs/hvdf
• Rest API server written in Java with most
popular libraries
• Public project, issues can be logged
https://jira.mongodb.org/browse/HVDF
• Can be run as-is, or customized as needed
74. 95
Feed
High volume data feed architecture
Channel
Sample Sample Sample Sample
Source
Source
Processor
Inline
Processing
Batch
Processing
Stream
Processing
Grouping by Feed
and Channel
Sources send
samples
Processors generate
derivative Channels
75. 96
HVDF -- High Volume Data Feed engine
HVDF – Reference implementation
REST
Service API
Processor
Plugins
Inline
Batch
Stream
Channel Data Storage
Raw
Channel
Data
Aggregated
Rollup T1
Aggregated
Rollup T2
Query Processor Streaming spout
Custom Stream
Processing Logic
Incoming Sample Stream
POST /feed/channel/data
GET
/feed/channeldata?time=XX
X&range=YYY
Real-time Queries
77. 98
Dynamic schema for sample data
Sample 1
{
deviceId: XXXX,
time: Date(…)
type: "VIEW",
…
}
Channel
Sample 2
{
deviceId: XXXX,
time: Date(…)
type: "CART_ADD",
cartId: 123, …
}
Sample 3
{
deviceId: XXXX,
time: Date(…)
type: “FB_LIKE”
}
Each sample
can have
variable fields
78. 99
Channels are sharded
Shard
Shard
Shard
Shard
Shard
Shard Key:
Customer_id
Sample
{
customer_id: XXXX,
time: Date(…)
type: "VIEW",
}
Channel
You choose how
to partition
samples
Samples can
have dynamic
schema
Scale
horizontally by
adding shards
Each shard is
highly available
79. 100
Channels are time partitioned
Channel
Sample Sample Sample Sample Sample Sample Sample Sample
- 2 days - 1 Day Today
Partitioning
keeps indexes
manageable
This is where all
of the writes
happen
Older partitions
are read only for
best possible
concurrency
Queries are routed
only to needed
partitions
Partition 1 Partition 2 Partition N
Each partition is
a separate
collection
Efficient and
space reclaiming
purging of old
data
80. 101
Dynamic queries on Channels
Channel
Sample Sample Sample Sample
App
App
App
Indexes
Queries Pipelines Map-Reduce
Create custom
indexes on
Channels
Use full mongodb
query language to
access samples
Use mongodb
aggregation
pipelines to access
samples
Use mongodb
inline map-reduce
to access samples
Full access to
field, text, and geo
indexing
81. 102
North America - West
North America - East
Europe
Geographically distributed system
Channel
Sample Sample Sample Sample
Source
Source
Source
Source
Source
Source
Sample
Sample
Sample
Sample
Geo shards per
location
Clients write
local nodes
Single view of
channel available
globally
83. 104
Insight – Useful Data
Useful data for better shopping:
• User history (e.g. recently seen products)
• User statistics (e.g. total purchases, visits)
• User interests (e.g. likes videogames and SciFi)
• User social network
84. 105
Insight – Useful Data
Useful data for selling more:
• Cross-selling: people who bought this item had
tendency to buy those other items (e.g. iPhone,
then bought iPhone case)
• Up-selling: people who looked at this item
eventually bought those items (alternative product
that may be better)
85. 106
• Get the recent activity for a user, to populate the "recently
viewed" list
db.activities.find({ userId: "u123", time: { $gt: DATE }}).
sort({ time: -1 }).limit(100)
• Get the recent activity for a product, to populate the "N users
bought this in the past N hours" list
db.activities.find({ itemId: "301671", time: { $gt: DATE }}).
sort({ time: -1 }).limit(100)
• Indices: time, userId + time, deviceId + time, itemId + time
• All queries should be time bound, since this is a lot of data!
Insight – User History
86. 107
• Get the recent number of views, purchases, etc for a user
db.activities.aggregate(([
{ $match: { userId: "u123", time: { $gt: DATE } }},
{ $group: { _id: "$type", count: {$sum: 1} } }])
• Get the total recent sales for a user
db.activities.aggregate(([
{ $match: { userId: "u123", time: { $gt: DATE }, type: "ORDER" }},
{ $group: { _id: "result", count: {$sum: "$totalPrice"} } }])
• Get the recent number of views, purchases, etc for an item
db.activities.aggregate(([
{ $match: { itemId: "301671", time: { $gt: DATE } }},
{ $group: { _id: "$type", count: {$sum: "1"} } }])
• Those aggregations are very fast, real-time
Insight – User Stats
87. 108
• number of activities for unique visitors for the past hour. Calculation of
uniques is hard for any system!
db.activities.aggregate(([
{ $match: { time: { $gt: NOW-1H } }},
{ $group: { _id: "$userId", count: {$sum: 1} } }], { allowDiskUse: 1 })
• Aggregation above can have issues (single shard final grouping, result
not persisted). Map Reduce is a better alternative here
var map = function() { emit(this.userId, 1); }
var reduce = function(key, values) { return Array.sum(values); }
db.activities.mapreduce(map, reduce,
{ query: { time: { $gt: NOW-1H } },
out: { replace: "lastHourUniques", sharded: true })
db.lastHourUniques.find({ userId: "u123" }) // number activities for a user
db.lastHourUniques.count() // total uniques
Insight – User Stats
89. 110
Let's simplify each activity recorded as the following:
{ userId: "u123", type: order, itemId: 2, time: DATE }
{ userId: "u123", type: order, itemId: 3, time: DATE }
{ userId: "u234", type: order, itemId: 7, time: DATE }
Calculate items bought by a user with Map Reduce:
- Match activities of type "order" for the past 2 weeks
- map: emit the document by userId
- reduce: push all itemId in a list
- Output looks like { _id: "u123", items: [2, 3, 8] }
User Activity – Items bought together
90. 111
Then run a 2nd mapreduce job from the previous output to compute
the number of occurrences of each item combination:
- query: go over all documents (1 document per userId)
- map: emit every combination of 2 items, starting with lowest
itemId
- reduce: sum up the total.
- output looks like { _id: { a: 2, b: 3 } , count: 36 }
User Activity – Items bought together
91. 112
Then obtain the most popular combinations per item:
- Index created on { _id.a : 1, count: 1 } and { _id.b: 1, count: 1 }
- Query with a threshold:
- db.combinations.find( { _id.a: "u123", count: { $gt: 10 }} ).sort({ count: -1 })
- db.combinations.find( { _id.b: "u123", count: { $gt: 10 }} ).sort({ count: -1 })
Later we can create a more compact recommendation collection
that includes popular combinations with weights, like:
{ itemId: 2, recom: [ { itemId: 32, weight: 36},
{ itemId: 158, weight: 23}, … ] }
User Activity – Items bought together
92. 113
User Activity – Hadoop integration
EDW
Management&Monitoring
Security&Auditing
RDBM
S
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Operational Analytical
93. 114
Commerce
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Session management
• Elastic pricing
• Recommendation models
• Predictive analytics
• Clickstream history
MongoDB
Connector for
Hadoop
95. 116
Connector Features and
Functionality
• Open-source on github
https://github.com/mongodb/mongo-hadoop
• Computes splits to read data
– Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
– MongoDB as a standard data source/destination
• Support for
– Filtering data with MongoDB queries
– Authentication
– Reading from Replica Set tags
– Appending to existing collections
97. 118
Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
98. 119
Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” =
"_id,name,age”) TBLPROPERTIES("mongo.uri" =
"mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or
BSONStorageHandler
How does eventual consistency fit into the idea of inventory? Is something in stock or out of stock?
Items on hand matters at order time. What about at buying time?
Are we pitching this as the system of record for inventory or as a single view on top of multiple, discrete inventory systems?
In a single view sort of application, where we’re designing for many use-cases instead of a single application, how do we handle schema design trade-offs?
Challenges for every service/component:
Schema
Indexing
Sharding
Most important criteria:
User facing latency
Linear scaling of services
Not shown on this slide:
Audit collection
Assortments – list of items in an order that a shop is going to make (backorder?)
Shipments – going to stores
one sku per item
fast reading / writing to support updating the inventor in real time
Make this like slide 30, drop the fields, just show the collection relations