Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay


Published on

This session will be a case study of eBay’s experience running MongoDB for project Zoom, in which eBay stores all media metadata for the site. This includes references to pictures of every item for sale on eBay. This cluster is eBay's first MongoDB installation on the platform and is a mission critical application. Yuri Finkelstein, an Enterprise Architect on the team, will provide a technical overview of the project and its underlying architecture.

1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Show app servers and mongos on them
  • Fix md5->new document ID
  • 3 shard20 writers
  • Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay

    1. 1. Yuri FinkelsteinLead Platform Services Architectyfinkelstein@ebay.comJohn FeibuschLead DBA Engineerjfeibusc@ebay.comMay 2013
    2. 2. About eBay Platform Services Platform Services is an org within a larger eBay Platformorg which is responsible for developing and operatingcommon services that are used by Web Applicationrunning on eBay Platform• Media Storage platform services: image blob and metadata• Unified Monitoring platform: logs and metrics• User Behavior Tracking• Ad Content management and analytics• Messaging and other middleware services
    3. 3. Platform Services and Media Metadata ServiceRequirements Platform Services is a DevOps organization• We develop, we test, we deploy, we operate, we monitor• Whatever we are responsible for, we own and understand at the depthof the entire stack• Therefore, we require transparency of the components we build on• Transparency at the level of source code visibility is ideal
    4. 4. Key Requirements Key requirements of Media Metadata Service• 99.999% availability• Strictly defined invocation latency @95 %• Simultaneous operation in multiple data centers with short replicationlatency• Reliable writes: synchronous writes to at least 2 nodes.• Read-write workload with reads / write ~= 10/1• Agility, fluid metadata content; constantly changing businessrequirements• Terabyte scale, billions of small entities to store and query• Scalability at extreme: number of pictures on eBay is constantly growing
    5. 5. Enters MongoDB We have been operating MongoDB in thisproject for over a year now Sharded cluster in 2 data centers Service nodes are built in Java and useMorphia and Mongo driver MongoS runs on the service nodes 1st year we were maturing the cluster forwrites only; this year we are taking reads Reads are from the user-facing webapplications with strong SLA requirements For reads, client first sets SlaveOK=trueand if required document is not found flipsto SlaveOK=false to read from Primary---- Shards --------Replicas--->P P PH H H---DC1--->--DC2-->SSSSSSSMorphiaService LayerMongo DriverMongoSMetadata ServiceNodeS – service instance; P – primarymongod; H – hidden member
    6. 6. Centralized MongoDB configuration store Our MongoDB deployment package is based oncustom-build RPM and contains heavy customizationscripts One of them is responsible for fetching configuration forthe node it’s running on from a remote configurationrepository at start-up time Benefits:• Can change MongoDB configuration instantly on arbitrarylarge number of nodes• Can change local system settings affecting MongoDB:read-ahead –settings on block devices and IO scheduler• Can relocate replica set members across machines (subjectto data migration)• Consistent inventory tracking, visibility into config settingson any Mongo machineCentralMongoDBConfigRepositoryP P P@ startup time
    7. 7. Upstart Upstart is a replacement for init.d; developed for Ubuntu, also used inRHEL 6 Can automatically start our monitoring agent whenever mongod starts.Handles multiple mongod instances well Example: sudo start mongod interface=0 Future: Upstart can be controlled by Puppet.
    8. 8. Run multiple MongoD instances on the same machine Starting to run multiple mongod processes on one node Instead of using multiple ports we create multiple virtual interfaces on asingle host and register them in DNS as if they were real IP addresses MongoD supports bind_ip which makes it possible to bind to a specificvirtual interface Why virtual interfaces ? So that DB hosts can be moved with just a DNS change Why do we want to run multiple MongoD on a single host? On large machines with lots of disk IO and storage capacity mongod can notutilize all IO resources Running multiple shards on the same machine reduces data granularity andreduces the scope of each write lock. This works well only when multiple MongoD on the same machine have similarworkload
    9. 9. Home grow MongoDB monitoring system Home grown agent runs oneach MongoDB host andcollects very specific metricsthat are not available inMMS:• Per block-device disk writelatency and disk IOPS• Details of per-collectionMongoDB metrics Can overlay multiple graphsform RS members on thesame chart GLE latency – very importantsince we are doing• getLastError ({w:2})
    10. 10. Media Metadata Service: Data Model 2 main collections: Item and Image• Item references multiple Images Item represents eBay Item:• _id in Item is external ID of the item in eBay site DB• These IDs are already sharded in balanced across Nlogical DB hosts using ID ranges• We use MongoDB pre-split points for initialmapping our N site DB shards to M MongoDB shards• This ensures good balance between the shards; Image represents a picture attached to anItem• _id in Image is based on modified ObjectID of Mongo• This ensures good distribution across any number ofshards Our choice of document IDs in bothcollections ensures good balance acrossMongo shards
    11. 11. Problem #1: What should be the ID for the documents? ObjectId is not a good shard key for sharded collection astimestamp occupies the first 4 bytes. Problem: how should the app generate the ID when this isrequired? Requirements:• Even distribution across shards both long term and shortterm• Localization of the placement of the indexed _id values in theB-Tree – minimize the chance of page fault on the index pageand increase the chance of collation of the dirty pages in pagecache to reduce the amount of random IO when flushing pagesto diss• Compactness in size is always good to preserve space One possible solution: 6 byte ID in the following order• 1 byte – rotating sequence ID incremented by each writer onevery document• 1 byte – writer ID; assuming number of writers < 256• 4 byte – timestamp in seconds Works with limitation that each writer can not insert morethan 256 documents per secondTTimestampMachineID SequenceNoMongoDB ObjectId():4 4 4SequenceNo WriterID TTimestamp1 1 4Shard-Friendly ID:
    12. 12. Shard Friednly ID detailsTimeSeq=0Seq=166-byte ID valueSeq=255ff …0f…00…55…aa…N-th min N-th+1 min20 contiguousranges for eachsequenceLet’s say we have 20 writers and 3shardsNumber of contiguous intervals ineach shard:256/3 * 20 = 1100Worse case scenario: eachcontiguous range requires aseparate IO. At 200 IOPS:~5 sec to flush itIn reality it’s much better becauseof 4 k pagesRate of writes 256 docs/secNumber of dirty locations over 1minute: 256 * 60 * 20 = 307,000So, if _id was md5 or some otherrandom value generator with~perfect distribution this wouldrequire 300 times more IOPS
    13. 13. Problem #2: md5 lookup problem Md5 is a digest of the image content; used for de-dupe Requirement: find image documents with a givenmd5 val Option 1: secondary index on the imagedocuments; does not work because:• Large DB, random reads cause disk IO• Image collections is sharded by image ID;forced to query all shards Option 2: Stand-alone replica set (cache)• Works since data is compact and fits in RAM;no disk IO• How do we store md5->image IDs in Mongo?• Option 2.1: As an array Does not work well since when refs are addeddocuments will grow and relocate.• Option 2.2: Single Binary Packed into an ID Works; lookup is based on prefix search andcovering index{_id:Binary(md5),ref: [ref1, ref2, ref3 …]}{_id:Binary(md5|ref)}Query:Db.coll.find ({_id: {$gt : Binary(md5|0x0000)}},{ _id : 1})
    14. 14. Problem #3: Item’s main picture size lookup Image document has image dimensions:width and height Item document references N pictures; one ofthem is main Problem: lookup image dimensions of theitem’s main picture for 50 item documents atonce with SLA for latency < 20 msec It’s a variation of Problem #2 except it’sworse because ItemID and imagedimensions are in different documents and50 lookups at once are required Again we need a dedicated replica set Option 1: prefix search with $or and $and Option 2: just query by _id Option 3: query by id but on anothercompound index: {_id:1, wh:1} Winner is option #3! Hint: covering index{_id:Binary(item|WxH) }Query:Db.coll.find ({$or: [{_id: {$gt : Binary(id1|0x0000),{$lt : Binary(id1|0xffff)}},{_id: {$gt : Binary(id2|0x0000),{$lt : Binary(id2|0xffff)}},…]}){ _id:item, wh:WxH }Query:Db.coll.find ({ _id : {$in : [item1, item2, .]}){ _id:item, wh:WxH }Query:Db.coll.find ({ _id : {$in : [item1, item2, .]}).hint({_id:1, wh:1})
    15. 15. Problem #4: Periodic export to Hadoop Problem: daily copy of the new orupdated documents to Hadoop Option 1: service does 2 writes: tomongo and to hadoop• Does not work since Hadoop is not anonline system Option 2: secondary index onlastUpdated (date); then query onlastUpdated > T• Does not work well since updating indexedlastUdated is costly; also consuming alarge number of docs from a live cluster isdisruptive to latency SLAs Option 3: OpLog replication• Winner: decouples export from site activity, Makes lastUpdated index unccessaryP P PProblem:P P POpLogListener??
    16. 16. Problem #5: What’s the fastest way to performa full scan? Problem: you have a huge database/collection,with terabytes of data and billions of documents You need to perform a form of batch processingon all the documents and you want the fastestpipe out of mongo Option 1: Do it on a live node as it’s serving traffic• Does not work well when the node is busy• Also – data consistency may be an issue Ok, need to take the node off-line Option 2: execute a natural-order scan:• Natural order cursor• Works, but slow; lot’s synchronization between twosides Option 3: N cursors using range query on _id orany other indexed field• Slow in general case when order of indexed valueson B-Tree and order on disk do not match Option 4: N natural-order cursorsOne cursor:db.collection.find({}, {$natural: 1})N cursors:db.collection.find({}, {$natural: 1}).skip (i*N).limit (N)
    17. 17. Summary We are running MongoDB in a demanding environment where it’sexposed to business sensitive online applications It seems to be reliable – this is what matters It has lots of features and gives the user lots of option to choose from It’s the user’s depth of understanding of the product and desire tohave visibility into every aspect of its performance that will determinewhen a particular use case will be a success or not
    18. 18. Questions? Thank you! Btw, if any of this sounds interesting, we have lots ofsimilar challenges to work on. So, you know the drill:yfinkelstein at ebay dot com