DB Scalability @ eBay eBay is one of the first and largest BASE environments based on Oracle DB App1 App2 • Basic Availability • Soft-state Business Business • Eventual consistency Logic Logic Every database we use is shared and partitioned Hint (shard key) Hint (shard key) • N logical hosts names are defined for each use case ahead of time DAL DAL • These logical hosts are mapped to physical based on static mapping tables which are controlled by DBAs Framework Framework • A common ORM framework called DAL provides powerful and consistent patterns for data scalability Applications If the client provides a hint along with every DB F1(Hint) F2(Hint) query: • DAL maps the hint to a logical host using one of N mapping Logical DB schemes (ex: modulus, lookup table, range, etc) … hosts … • Logical host is then mapped to a physical using L-to-Ph map (shards) • The query is sent to just one shard If the client does not have a hint, the query is sent to Config all shards and the results are joined on the client Physical with the help of DAL framework … Master DB Side-effects: hosts • Hint is not part of the query; client has to manage it Physical • Logical to Physical mapping scheme becomes extra piece of … standby client configuration • Shard rebalancing is “DBA magic” DB hosts
Key desired improvements All eBay site-facing applications use the scheme outlined above It’s proven to scale to tens of thousands of developers, petabytes of data, hundreds of millions of SQL queries per day But there is always room for improvements and new ideas • ORM is not the fastest way to develop; how do we achieve faster development cycles and reduce schema mapping frictions? • How do we add new attributes to tables faster and without DBA’s involvement? Schema free approach sounds interesting. • Can we make the hint transparent, ex: auto-extract it from queries? • Can we rebalance the data seamlessly and automatically? • Can we add shards faster in order to scale out on demand and transparently to applications? • How do we deploy new DBs to the cloud on demand? And what about performance? Can we use RAM more aggressively and seamlessly to speed up queries?
Enters MongoDB We are playing with MongoDB since 2010. Why? Business Logic Document Its scalability scheme is very similar to how we shard RDBMS Morphia/Mongo • Single master for writes, eventually consistent slaves for Driver reads Dynamic • Horizontal partitioning of data sets is a norm at eBay Config • MongoS is performing familiar scatter-gather and client- side merge-sorts MongoS F(Shard Key) We don’t use distributed transactions since day 1; transactional updates of multiple tables … that we do use can be simulated by atomic <- Replicas -> updates of a single Mongo document MongoDB offers a number of features that … help address our goals mentioned earlier: • Developers love document model and schema-free persistence … • Hints are embedded into the queries • MongoDB has automatic shard rebalancing • Shards can be added on demand without application restart and data will be auto-rebalanced ---------- Shards ------- • We can easily bring it up in the cloud since cloud machines have storage
Case study #1: eBay Search Suggestions Search suggestion list is a MongoDB document indexed by word prefix as well as by some metadata: product category, search domain, etc. Must have < 60-70msec round trip end to end MongoDB query < 1.4msec Data set fits in RAM; 100-s M documents Data is bulk loaded once a day from Hadoop, but can be tweaked on demand during sale promotions, etc Single replica set, no shards in this case MongoDB benefits: • Multiple indexes allow flexible lookups • In-memory data placement ensures lookup speed • Large data set is durable and replicated
Case Study #2: Cloud Manager “State Hub” Query State Hub powers eBay CloudProvision ResourcesResources Every resource provisioned by the cloud is and Topology represented by a single Mongo document Documents contain highly structured metadata reflecting roles and grouping of the resources Lookup by both primary and secondary State Hub indexes Mongo Several GB data sets, easily fit in RAM Update Documents are not uniform resource state All resources have “State” field which is updated periodically to reflect health state of the underlying resource Mixed workload: lots of in-place writes, but also lots of read queries
Case Study #3: eBay Merchandizing Info Cache Merchandizing backend powers eBay product/item classification and categorization Each MongoDB document represents a cluster of similar products Numerous relationships between clusters are modeled as R1 document attributes Cluster1 Cluster2 Relationship hierarchy traversal is achieved by issuing a R3 number of queries on “edge” attributes R2 Each instance of such a hierarchy is called a model; there Cluster3 are lots of models Again, data set fits in RAM, single replica set Replica set members are located in 3 different data centers (3+2+2) with all members in a single data center having higher weight to avoid moving master away MongoDB benefits: • Schema-free design and declarative indexes are perfect for this use case where new attributes and new queries are constantly being added • Async replication across multiple data centers • MongoDB Java Driver ensures automatic detection of proximity of clients to replica set members; reads with slaveOK=true are served from local data center nodes which insures low response latency
Case Study #4: Zoom – Media Metadata Store This is a new mega project which is a work in progress MongoDB is being evaluated as a storage backend for all media-related metadata on the site (example: picture IDs with lots attributes) Requirements: • Tens of TBs data set, Millions of documents: data set must be partitioned; this is our first use case where MongoDB sharding is used • System of record for picture info; data can not be lost! • Replication/DR across 2 data centers; local DC reads are required • Queries are from site-facing flows; <10msec response time SLA • Mixed workload: both inserts and reads are happening concurrently all the time Can MongoDB do it ??
Zoom: Data Model 2 main collections: Item and Image • Item references multiple Images Item represents eBay Item: • _id in Item is external ID of the item in eBay site DB • These IDs are already sharded in balanced across N logical DB hosts using ID ranges • We use MongoDB pre-split points for initial mapping our N site DB shards to M MongoDB shards • This ensures good balance between the shards; Image represents a picture attached to an Item • _id in Image is md5 of the image content • This ensures good distribution across any number of shards • Md5 is also used to find duplicate images Our choice of document IDs in both collections ensures good balance across Mongo shards We never query both collections in a single service request to ensure data consistency and to have only one index lookup
Zoom: Service Topology and Configuration MongoS is deployed on app servers • Ensures network IO on MongoS won’t become a bottleneck • This is a very familiar pattern in eBay as was explained in the>--- DC1(Primary)--- beginning of this presentation M shards; each replica set has 6 members M M M M • 3 + 3 in 2 data centers • Master can be only in one DC during automatic failover; manual failover may activate another DC --- Replicas ---> • One slave in the secondary DC is invisible for reads and is dedicated to periodic backups/snapshots (more on this later) For reads, client first sets SlaveOK=true and if required document is not found flips to SlaveOK=false to read from Master -- DC2(Secondary)--> Home-grown MongoDB configuration and monitoring agent is running on every node • Fetches MongoD configuration from a central configuration store and saves it to local config file • Manages lifecycle of MongoD B B B B • Monitors state and metrics ---- Shards -----
Zoom: Data Backup and Restore strategy Goals: • Take periodic backups of the entire data set Application • Be able to recover from backup • Do not loose any writes that have happened after last snapshot • Briefly service unavailability during recovery is better than data Dual-write loss … to capped M collection C Dual writes on the client • Regular write to main cluster … • Second write to another Mongo cluster: single replica set, capped collection, the data written is similar to REDO log record Recovery B Agent Hidden slave in each shard has volume mounted on a remote storage appliance capable of instant file system snapshot; captures both DB files and journal files If DB recovery is activated: • All MongoD on primary cluster are shutdown • NFS slave is remounted to snapshot volume Instant • MongoD on this machine is started as a masterShapshot • MongoD on other replica set members are started cold • Full sync-up from masterCapable • Master is switched to a regular member device • Writes that occurred since time when the backup was taken are replayed from the REDO log capped collection in the secondary cluster •
Key Learning MongoDB can be a very powerful tool but use it wisely Deletes can be slow; automatic balancer is dangerous; use it only when you must (example: be careful when adding new shards) Use explain for every query; disable full scans to discover inefficiencies early Query profiler is great Retry every failed query at least once; long tail in response times is possible when data set > RAM size