Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Elastic Metadata Store for eBay’s Media Platform

4,564 views

Published on

In order to build a robust, multi-tenant, highly available storage services that meet the business’ SLA your databases has to be sharded. But if your service has to scale continuously through the incremental additions of storage without service interruption or human intervention, basic static sharding is not enough. At eBay, we are building MStore to solve this problem, with MongoDB as the storage engine. In this presentation, we will dive into the key design concepts of this solution.

Published in: Technology
  • You can hardly find a student who enjoys writing a college papers. Among all the other tasks they get assigned in college, writing essays is one of the most difficult assignments. Fortunately for students, there are many offers nowadays which help to make this process easier. The best service which can help you is ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/2ZDZFYj ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ http://bit.ly/2ZDZFYj ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

An Elastic Metadata Store for eBay’s Media Platform

  1. 1. Elastic metadata store for eBay media platform Yuri Finkelstein, architect, eBay Global Platform Services MongoDB SF 2014
  2. 2. Introduction • eBay items have media: • pictures, 360 views, overlays, video, etc • binary content + metadata • metadata is rich and is best modeled as a document • 99% reads, 1% writes, ~1/100th deleted daily • MongoDB is a reasonable fit • But we need a service on top of it
  3. 3. What is a data service? • Data service vs database instances • SLA • data lifecycle management automation instead of DBA excellence • no downtime during hardware repair, maintenance • no downtime as data grows and hardware is added • multiple tenants • tenants come and go, grow (and shrink) at different rates • different tenant requirements for cross DC replication and data access latencies
  4. 4. What is wrong with this picture? • Vertical scalability only • Prone to coarse-grain outages • No service model, no SLA • Limited number of connections • etc tenant A AApppp11 App1 App1 App1 App2 DB DB DB driver driver Replicas tenant B
  5. 5. Is MongoDB sharding the answer? • Can scale out in theory, but at the time of expansion we either need a downtime or likely have SLA breach • Mongo chunks are logical • flurry of IO or too slow to engage new hardware • Still no service boundary • Other problems mentioned earlier AApppp11 App1 App1 App1 App2 driver Replicas DB DB DB DB DB DB DB DB DB DB DB DB Shards router router driver new shard ? MongoS
  6. 6. On the effect of chunk migration: great slide by Kenny Gorman/Object Rocket
  7. 7. Buckets • Need smaller data granularity - bucket • ~100 buckets per tenant to begin with • bucket algebra • create / delete • split / merge • compact • move (to another RS) host (storage server) MongoD (RS1) b1 b2 b3 b4 MongoD (RS2) b25 b26 b27 b28 MongoD (RS3) b49 b50 b51 b52 MongoD (RS4) b73 b74 b75 b76
  8. 8. _id=>BucketName ? • Can be done in a number of different ways, based on the use case • if range queries on _id are needed, use “order preserving partitioner”, ex. like in HBase • If access is by _id only, consistent hashing works well, ex: Memcached, Cassandra
  9. 9. MStore DB Proxy • proxy/manage DB connections • runs on each host • lightweight and efficient • connects to mongo over unix socket • BSON in, BSON out • performs “logical address translation” in BSON messages • bucketName => mongoDB dbName • dbName changes after each compaction host (storage server) Proxy MongoD (RS1) b1 b2 b3 b4 MongoD (RS2) b25 b26 b27 b28 MongoD (RS3) b49 b50 b51 b52 MongoD (RS4) b73 b74 b75 b76
  10. 10. MStore Service Tier • stateless REST service • domain API • route calculation: • _id=>BucketName • BucketName=>ReplicaSetId • isWrite or !staleReadOk? • primary MongoD or some secondary MongoD • MongoD=>host • request goes to proxy@host AApppp11 App1 App1 App1 App2 mstore service http/json storage servers http/bson
  11. 11. Connections, Protocols, Payload formats • Too many connections problem with MongoDB • The service forms BSON and sends it to proxy over HTTP • Proxy needs only few connections with mongo • Fair request scheduling mstore service BSON/HTTP 1.1 with Keep Alive Proxy Native transport/ Unix socket MongoD
  12. 12. MStore Coordination Tier • Manages cluster map • Serves queries and pushes changes to map cache in the service nodes and in proxies • Functionally similar to MongoDB Config server or ZooKeeper • Backed by a transactional, highly available repository MStore Coordination db db crd crd crd GET map on init Push update on change Legend coordinator service instance cluster map cache MStore service instance m m p r o x y Coordinator DB
  13. 13. The big picture AApppp11 App1 App1 App1 App2 MStore Coordination db db crd crd crd mstore service storage servers Workflow Management & Automation tools
  14. 14. Bucket Compaction • Document deletes are expensive • We prefer marking documents with tombstones, hence need to compact • Compaction is done on an AUX storage node to not disturb ongoing operation • When new bucket image is ready, in proxy: • hold new writes • flush pending writes • “flip the switch” : BucketName->DB • resume writes • This is not easy and is implemented as a multi-step workflow
  15. 15. Other workflows • Compaction workflow is quite hard to master • But the good news is - other workflows are very similar: • bucket move is ~the same except the target RS is different • bucket split creates 2 new buckets • bucket merge is like 2 moves • etc
  16. 16. What are we achieving? • Elastic expansion of the storage tier • Full control over what/when/how fast to rebalance • Efficiency of rebalancing • Smooth and predictable operation • Intelligent DB connection management • SLA measurement
  17. 17. Final words • Open source? • Looking for feedback • Contact us if interested • Thank you!
  18. 18. Appendix
  19. 19. Buckets • Need smaller data granularity - bucket • sizeof(bucket) << sizeof(data set) • bucket is a single MongoDB DB • one Replica Set has many buckets • Bucket operations: • create / delete • split / merge • compact • move (to another RS) • Multiple MongoD processes from different replica sets on the same physical host for best storage utilization (on big bare metal) • These could be LXC containers host (storage server) MongoD (RS1) b1 b2 b3 b4 MongoD (RS2) b25 b26 b27 b28 MongoD (RS3) b49 b50 b51 b52 MongoD (RS4) b73 b74 b75 b76
  20. 20. Bucket Compaction • Document deletes are expensive • We prefer marking documents with tombstones • Compaction is the process of generating a new image of the bucket after purging deleted and expired documents • Compaction is done on an AUX storage node to not disturb ongoing operation • When new bucket image is ready - just “flip the switch” in the proxy • This is not easy and is implemented as a multi-step workflow Compaction Workflow 1. mark oplog time; take a snapshot 2. copy snapshot to Aux node 3. start 2 stand-alone mongod • source and destination 4. bulk-scan source • skip deleted or expired docs • insert docs into destination db 5. transfer compacted bucket image to all nodes in the original replica set 6. replay oplog from old db to new db “Flip the switch” phase: 7. pause writes to old db in proxy 8. keep oplog replay until all queues are drained 9. tell proxy to enable writes to new DB 10. update Coordinator map

×