MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB presented by Yuri Finkelstein, Architect, eBay

Yuri Finkelstein
Lead Platform Services Architect
yfinkelstein@ebay.com
John Feibusch
Lead DBA Engineer
jfeibusc@ebay.com
May 2013

About eBay Platform Services
 Platform Services is an org within a larger eBay Platform
org which is responsible for developing and operating
common services that are used by Web Application
running on eBay Platform
• Media Storage platform services: image blob and metadata
• Unified Monitoring platform: logs and metrics
• User Behavior Tracking
• Ad Content management and analytics
• Messaging and other middleware services

Platform Services and Media Metadata Service
Requirements
 Platform Services is a DevOps organization
• We develop, we test, we deploy, we operate, we monitor
• Whatever we are responsible for, we own and understand at the depth
of the entire stack
• Therefore, we require transparency of the components we build on
• Transparency at the level of source code visibility is ideal

Key Requirements
 Key requirements of Media Metadata Service
• 99.999% availability
• Strictly defined invocation latency @95 %
• Simultaneous operation in multiple data centers with short replication
latency
• Reliable writes: synchronous writes to at least 2 nodes.
• Read-write workload with reads / write ~= 10/1
• Agility, fluid metadata content; constantly changing business
requirements
• Terabyte scale, billions of small entities to store and query
• Scalability at extreme: number of pictures on eBay is constantly growing

Enters MongoDB
 We have been operating MongoDB in this
project for over a year now
 Sharded cluster in 2 data centers
 Service nodes are built in Java and use
Morphia and Mongo driver
 MongoS runs on the service nodes
 1st year we were maturing the cluster for
writes only; this year we are taking reads
 Reads are from the user-facing web
applications with strong SLA requirements
 For reads, client first sets SlaveOK=true
and if required document is not found flips
to SlaveOK=false to read from Primary
---- Shards -----
---Replicas--->
P P P
H H H
---DC1--->--DC2-->
S
S
S
S
S
S
S
Morphia
Service Layer
Mongo Driver
MongoS
Metadata Service
Node
S – service instance; P – primary
mongod; H – hidden member

Centralized MongoDB configuration store
 Our MongoDB deployment package is based on
custom-build RPM and contains heavy customization
scripts
 One of them is responsible for fetching configuration for
the node it’s running on from a remote configuration
repository at start-up time
 Benefits:
• Can change MongoDB configuration instantly on arbitrary
large number of nodes
• Can change local system settings affecting MongoDB:
read-ahead –settings on block devices and IO scheduler
• Can relocate replica set members across machines (subject
to data migration)
• Consistent inventory tracking, visibility into config settings
on any Mongo machine
Central
MongoDB
Config
Repository
P P P
@ startup time

Upstart
 Upstart is a replacement for init.d; developed for Ubuntu, also used in
RHEL 6
 Can automatically start our monitoring agent whenever mongod starts.
Handles multiple mongod instances well
 Example:
 sudo start mongod interface=0
 Future: Upstart can be controlled by Puppet.

Run multiple MongoD instances on the same machine
 Starting to run multiple mongod processes on one node
 Instead of using multiple ports we create multiple virtual interfaces on a
single host and register them in DNS as if they were real IP addresses
 MongoD supports bind_ip which makes it possible to bind to a specific
virtual interface
 Why virtual interfaces ?
 So that DB hosts can be moved with just a DNS change
 Why do we want to run multiple MongoD on a single host?
 On large machines with lots of disk IO and storage capacity mongod can not
utilize all IO resources
 Running multiple shards on the same machine reduces data granularity and
reduces the scope of each write lock.
 This works well only when multiple MongoD on the same machine have similar
workload

Home grow MongoDB monitoring system
 Home grown agent runs on
each MongoDB host and
collects very specific metrics
that are not available in
MMS:
• Per block-device disk write
latency and disk IOPS
• Details of per-collection
MongoDB metrics
 Can overlay multiple graphs
form RS members on the
same chart
 GLE latency – very important
since we are doing
• getLastError ({w:2})

Media Metadata Service: Data Model
 2 main collections: Item and Image
• Item references multiple Images
 Item represents eBay Item:
• _id in Item is external ID of the item in eBay site DB
• These IDs are already sharded in balanced across N
logical DB hosts using ID ranges
• We use MongoDB pre-split points for initial
mapping our N site DB shards to M MongoDB shards
• This ensures good balance between the shards;
 Image represents a picture attached to an
Item
• _id in Image is based on modified ObjectID of Mongo
• This ensures good distribution across any number of
shards
 Our choice of document IDs in both
collections ensures good balance across
Mongo shards

Problem #1: What should be the ID for the documents?
 ObjectId is not a good shard key for sharded collection as
timestamp occupies the first 4 bytes.
 Problem: how should the app generate the ID when this is
required?
 Requirements:
• Even distribution across shards both long term and short
term
• Localization of the placement of the indexed _id values in the
B-Tree – minimize the chance of page fault on the index page
and increase the chance of collation of the dirty pages in page
cache to reduce the amount of random IO when flushing pages
to diss
• Compactness in size is always good to preserve space
 One possible solution: 6 byte ID in the following order
• 1 byte – rotating sequence ID incremented by each writer on
every document
• 1 byte – writer ID; assuming number of writers < 256
• 4 byte – timestamp in seconds
 Works with limitation that each writer can not insert more
than 256 documents per second
TTimestam
p
MachineID SequenceNo
MongoDB ObjectId():
4 4 4
SequenceNo WriterID TTimestamp
1 1 4
Shard-Friendly ID:

Shard Friednly ID details
Time
Seq=0
Seq=16
6-byte ID value
Seq=255
ff …
0f…
00…
55…
aa…
N-th min N-th+1 min
20 contiguous
ranges for each
sequence
Let’s say we have 20 writers and 3
shards
Number of contiguous intervals in
each shard:
256/3 * 20 = 1100
Worse case scenario: each
contiguous range requires a
separate IO. At 200 IOPS:
~5 sec to flush it
In reality it’s much better because
of 4 k pages
Rate of writes 256 docs/sec
Number of dirty locations over 1
minute: 256 * 60 * 20 = 307,000
So, if _id was md5 or some other
random value generator with
~perfect distribution this would
require 300 times more IOPS

Problem #2: md5 lookup problem
 Md5 is a digest of the image content; used for de-
dupe
 Requirement: find image documents with a given
md5 val
 Option 1: secondary index on the image
documents; does not work because:
• Large DB, random reads cause disk IO
• Image collections is sharded by image ID;
forced to query all shards
 Option 2: Stand-alone replica set (cache)
• Works since data is compact and fits in RAM;
no disk IO
• How do we store md5->image IDs in Mongo?
• Option 2.1: As an array
 Does not work well since when refs are added
documents will grow and relocate.
• Option 2.2: Single Binary Packed into an ID
 Works; lookup is based on prefix search and
covering index
{
_id:Binary(md5),
ref: [ref1, ref2, ref3 …]
}
{
_id:Binary(md5|ref)
}
Query:
Db.coll.find (
{
_id: {$gt : Binary(md5|0x0000)}
},
{ _id : 1}
)

Problem #3: Item’s main picture size lookup
 Image document has image dimensions:
width and height
 Item document references N pictures; one of
them is main
 Problem: lookup image dimensions of the
item’s main picture for 50 item documents at
once with SLA for latency < 20 msec
 It’s a variation of Problem #2 except it’s
worse because ItemID and image
dimensions are in different documents and
50 lookups at once are required
 Again we need a dedicated replica set
 Option 1: prefix search with $or and $and
 Option 2: just query by _id
 Option 3: query by id but on another
compound index: {_id:1, wh:1}
 Winner is option #3! Hint: covering index
{_id:Binary(item|WxH) }
Query:
Db.coll.find ({
$or: [
{_id: {$gt : Binary(id1|0x0000),
{$lt : Binary(id1|0xffff)}
},
{_id: {$gt : Binary(id2|0x0000),
{$lt : Binary(id2|0xffff)}
},
…
]})
{ _id:item, wh:WxH }
Query:
Db.coll.find (
{ _id : {$in : [item1, item2, .]})
{ _id:item, wh:WxH }
Query:
Db.coll.find (
{ _id : {$in : [item1, item2, .]})
.hint({_id:1, wh:1})

Problem #4: Periodic export to Hadoop
 Problem: daily copy of the new or
updated documents to Hadoop
 Option 1: service does 2 writes: to
mongo and to hadoop
• Does not work since Hadoop is not an
online system
 Option 2: secondary index on
lastUpdated (date); then query on
lastUpdated > T
• Does not work well since updating indexed
lastUdated is costly; also consuming a
large number of docs from a live cluster is
disruptive to latency SLAs
 Option 3: OpLog replication
• Winner:
 decouples export from site activity,
 Makes lastUpdated index unccessary
P P P
Problem:
P P P
OpLog
Listener
??

Problem #5: What’s the fastest way to perform
a full scan?
 Problem: you have a huge database/collection,
with terabytes of data and billions of documents
 You need to perform a form of batch processing
on all the documents and you want the fastest
pipe out of mongo
 Option 1: Do it on a live node as it’s serving traffic
• Does not work well when the node is busy
• Also – data consistency may be an issue
 Ok, need to take the node off-line
 Option 2: execute a natural-order scan:
• Natural order cursor
• Works, but slow; lot’s synchronization between two
sides
 Option 3: N cursors using range query on _id or
any other indexed field
• Slow in general case when order of indexed values
on B-Tree and order on disk do not match
 Option 4: N natural-order cursors
One cursor:
db.collection.find
({}, {$natural: 1})
N cursors:
db.collection.find
({}, {$natural: 1})
.skip (i*N)
.limit (N)

Summary
 We are running MongoDB in a demanding environment where it’s
exposed to business sensitive online applications
 It seems to be reliable – this is what matters
 It has lots of features and gives the user lots of option to choose from
 It’s the user’s depth of understanding of the product and desire to
have visibility into every aspect of its performance that will determine
when a particular use case will be a success or not

Questions?
 Thank you!
 Btw, if any of this sounds interesting, we have lots of
similar challenges to work on. So, you know the drill:
yfinkelstein at ebay dot com

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB presented by Yuri Finkelstein, Architect, eBay

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB presented by Yuri Finkelstein, Architect, eBay

Similar to MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB presented by Yuri Finkelstein, Architect, eBay (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB presented by Yuri Finkelstein, Architect, eBay

Editor's Notes