Yuri Finkelstein
eBay Platform Architect
yfinkelstein@ebay.com

May 2012
DB Scalability @ eBay
 eBay is one of the first and largest BASE
  environments based on Oracle DB                                          App1                                App2
    •   Basic Availability
    •   Soft-state                                                       Business                            Business
    •   Eventual consistency                                              Logic                               Logic
 Every database we use is shared and partitioned
                                                                      Hint (shard key)                    Hint (shard key)
    •   N logical hosts names are defined for each use case ahead
        of time
                                                                          DAL                                 DAL
    •   These logical hosts are mapped to physical based on static
        mapping tables which are controlled by DBAs                    Framework                           Framework
    •   A common ORM framework called DAL provides powerful
        and consistent patterns for data scalability
                                                                                          Applications

 If the client provides a hint along with every DB                            F1(Hint)                      F2(Hint)
  query:
    •   DAL maps the hint to a logical host using one of N mapping                           Logical DB
        schemes (ex: modulus, lookup table, range, etc)
                                                                                  …          hosts                    …
    •   Logical host is then mapped to a physical using L-to-Ph map                          (shards)
    •   The query is sent to just one shard

 If the client does not have a hint, the query is sent to                                       Config
  all shards and the results are joined on the client                                                           Physical
  with the help of DAL framework                                                         …                      Master DB
 Side-effects:                                                                                                 hosts
    •   Hint is not part of the query; client has to manage it                                                  Physical
    •   Logical to Physical mapping scheme becomes extra piece of                        …                      standby
        client configuration
    •   Shard rebalancing is “DBA magic”                                                                        DB hosts
Key desired improvements

 All eBay site-facing applications use the scheme outlined above
 It’s proven to scale to tens of thousands of developers, petabytes of data, hundreds
  of millions of SQL queries per day
 But there is always room for improvements and new ideas
   • ORM is not the fastest way to develop; how do we achieve faster development cycles and reduce
     schema mapping frictions?
   • How do we add new attributes to tables faster and without DBA’s involvement? Schema free approach
     sounds interesting.
   • Can we make the hint transparent, ex: auto-extract it from queries?
   • Can we rebalance the data seamlessly and automatically?
   • Can we add shards faster in order to scale out on demand and transparently to applications?
   • How do we deploy new DBs to the cloud on demand?


 And what about performance? Can we use RAM more aggressively and
  seamlessly to speed up queries?
Enters MongoDB

 We are playing with MongoDB since 2010.
  Why?                                                                Business
                                                                       Logic            Document
 Its scalability scheme is very similar to how
  we shard RDBMS                                                    Morphia/Mongo
   •   Single master for writes, eventually consistent slaves for
                                                                        Driver
       reads                                                                            Dynamic
   •   Horizontal partitioning of data sets is a norm at eBay                           Config
   •   MongoS is performing familiar scatter-gather and client-
       side merge-sorts                                              MongoS
                                                                                 F(Shard Key)
 We don’t use distributed transactions since
  day 1; transactional updates of multiple tables
                                                                                    …
  that we do use can be simulated by atomic




                                                                                                   <- Replicas ->
  updates of a single Mongo document
 MongoDB offers a number of features that                                          …
  help address our goals mentioned earlier:
   •   Developers love document model and schema-free
       persistence
                                                                                    …
   •   Hints are embedded into the queries
   •   MongoDB has automatic shard rebalancing
   •   Shards can be added on demand without application
       restart and data will be auto-rebalanced                     ---------- Shards -------
   •   We can easily bring it up in the cloud since cloud
       machines have storage
Case study #1: eBay Search Suggestions


                      Search suggestion list is a MongoDB document
                       indexed by word prefix as well as by some
                       metadata: product category, search domain,
                       etc.
                      Must have < 60-70msec round trip end to end
                      MongoDB query < 1.4msec
                      Data set fits in RAM; 100-s M documents
                      Data is bulk loaded once a day from Hadoop,
                       but can be tweaked on demand during sale
                       promotions, etc
                      Single replica set, no shards in this case
                      MongoDB benefits:
                        •   Multiple indexes allow flexible lookups
                        •   In-memory data placement ensures lookup speed
                        •   Large data set is durable and replicated
Case Study #2: Cloud Manager “State Hub”

             Query               State Hub powers eBay Cloud
Provision
             Resources
Resources                        Every resource provisioned by the cloud is
             and Topology
                                  represented by a single Mongo document
                                 Documents contain highly structured
                                  metadata reflecting roles and grouping of
                                  the resources
                                 Lookup by both primary and secondary
            State Hub             indexes

                        Mongo    Several GB data sets, easily fit in RAM
            Update
                                 Documents are not uniform
            resource
            state                All resources have “State” field which is
                                  updated periodically to reflect health state
                                  of the underlying resource
                                 Mixed workload: lots of in-place writes, but
                                  also lots of read queries
Case Study #3: eBay Merchandizing Info Cache

                                    Merchandizing backend powers eBay product/item
                                     classification and categorization
                                    Each MongoDB document represents a cluster of similar
                                     products
                                    Numerous relationships between clusters are modeled as
                  R1                 document attributes
  Cluster1              Cluster2
                                    Relationship hierarchy traversal is achieved by issuing a
             R3                      number of queries on “edge” attributes
                        R2
                                    Each instance of such a hierarchy is called a model; there
             Cluster3                are lots of models
                                    Again, data set fits in RAM, single replica set
                                    Replica set members are located in 3 different data
                                     centers (3+2+2) with all members in a single data center
                                     having higher weight to avoid moving master away
                                    MongoDB benefits:
                                       •   Schema-free design and declarative indexes are perfect for this use
                                           case where new attributes and new queries are constantly being
                                           added
                                       •   Async replication across multiple data centers
                                       •   MongoDB Java Driver ensures automatic detection of proximity
                                           of clients to replica set members; reads with slaveOK=true are
                                           served from local data center nodes which insures low
                                           response latency
Case Study #4: Zoom – Media Metadata Store

 This is a new mega project which is a work in progress
 MongoDB is being evaluated as a storage backend for all media-related
  metadata on the site (example: picture IDs with lots attributes)
 Requirements:
   • Tens of TBs data set, Millions of documents: data set must be partitioned; this is our
     first use case where MongoDB sharding is used
   • System of record for picture info; data can not be lost!
   • Replication/DR across 2 data centers; local DC reads are required
   • Queries are from site-facing flows; <10msec response time SLA
   • Mixed workload: both inserts and reads are happening concurrently all the time


 Can MongoDB do it ??
Zoom: Data Model

                    2 main collections: Item and Image
                      •   Item references multiple Images

                    Item represents eBay Item:
                      •   _id in Item is external ID of the item in eBay site DB
                      •   These IDs are already sharded in balanced across N
                          logical DB hosts using ID ranges
                      •   We use MongoDB pre-split points for initial
                          mapping our N site DB shards to M MongoDB shards
                      •   This ensures good balance between the shards;

                    Image represents a picture attached to an
                     Item
                      •   _id in Image is md5 of the image content
                      •   This ensures good distribution across any number of
                          shards
                      •   Md5 is also used to find duplicate images

                    Our choice of document IDs in both
                     collections ensures good balance across
                     Mongo shards
                    We never query both collections in a single
                     service request to ensure data consistency
                     and to have only one index lookup
Zoom: Service Topology and Configuration

                                                                              MongoS is deployed on app servers
                                                                                •   Ensures network IO on MongoS won’t become a bottleneck
                                                                                •   This is a very familiar pattern in eBay as was explained in the
>
--- DC1(Primary)---




                                                                                    beginning of this presentation

                                                                              M shards; each replica set has 6 members
                                     M M M M                                    •   3 + 3 in 2 data centers
                                                                                •   Master can be only in one DC during automatic failover; manual
                                                                                    failover may activate another DC
                                                        --- Replicas --->      •   One slave in the secondary DC is invisible for reads and is
                                                                                    dedicated to periodic backups/snapshots (more on this later)

                                                                              For reads, client first sets SlaveOK=true and if
                                                                               required document is not found flips to
                                                                               SlaveOK=false to read from Master
          -- DC2(Secondary)-->




                                                                              Home-grown MongoDB configuration and monitoring
                                                                               agent is running on every node
                                                                                •   Fetches MongoD configuration from a central configuration store
                                                                                    and saves it to local config file
                                                                                •   Manages lifecycle of MongoD
                                     B B B B                                    •   Monitors state and metrics


                                  ---- Shards -----
Zoom: Data Backup and Restore strategy

                             Goals:
                                •   Take periodic backups of the entire data set
     Application                •   Be able to recover from backup
                                •   Do not loose any writes that have happened after last snapshot
                                •   Briefly service unavailability during recovery is better than data
           Dual-write               loss …
           to capped
    M      collection   C    Dual writes on the client
                                •   Regular write to main cluster
    …




                                •   Second write to another Mongo cluster: single replica set,
                                    capped collection, the data written is similar to REDO log record
            Recovery
    B        Agent             Hidden slave in each shard has volume mounted on a
                                remote storage appliance capable of instant file
                                system snapshot; captures both DB files and journal
                                files
                               If DB recovery is activated:
                                •   All MongoD on primary cluster are shutdown
                                •   NFS slave is remounted to snapshot volume
 Instant                        •   MongoD on this machine is started as a master
Shapshot                        •   MongoD on other replica set members are started cold
                                •   Full sync-up from master
Capable                         •   Master is switched to a regular member
 device                         •   Writes that occurred since time when the backup was taken
                                    are replayed from the REDO log capped collection in the
                                    secondary cluster
                                •
Key Learning


 MongoDB can be a very powerful tool but use it wisely
 Deletes can be slow; automatic balancer is dangerous; use it only when you
  must (example: be careful when adding new shards)
 Use explain for every query; disable full scans to discover inefficiencies
  early
 Query profiler is great
 Retry every failed query at least once; long tail in response times is possible
  when data set > RAM size
Questions?


 Thank you!

MongoDB at eBay

  • 1.
    Yuri Finkelstein eBay PlatformArchitect yfinkelstein@ebay.com May 2012
  • 2.
    DB Scalability @eBay  eBay is one of the first and largest BASE environments based on Oracle DB App1 App2 • Basic Availability • Soft-state Business Business • Eventual consistency Logic Logic  Every database we use is shared and partitioned Hint (shard key) Hint (shard key) • N logical hosts names are defined for each use case ahead of time DAL DAL • These logical hosts are mapped to physical based on static mapping tables which are controlled by DBAs Framework Framework • A common ORM framework called DAL provides powerful and consistent patterns for data scalability Applications  If the client provides a hint along with every DB F1(Hint) F2(Hint) query: • DAL maps the hint to a logical host using one of N mapping Logical DB schemes (ex: modulus, lookup table, range, etc) … hosts … • Logical host is then mapped to a physical using L-to-Ph map (shards) • The query is sent to just one shard  If the client does not have a hint, the query is sent to Config all shards and the results are joined on the client Physical with the help of DAL framework … Master DB  Side-effects: hosts • Hint is not part of the query; client has to manage it Physical • Logical to Physical mapping scheme becomes extra piece of … standby client configuration • Shard rebalancing is “DBA magic” DB hosts
  • 3.
    Key desired improvements All eBay site-facing applications use the scheme outlined above  It’s proven to scale to tens of thousands of developers, petabytes of data, hundreds of millions of SQL queries per day  But there is always room for improvements and new ideas • ORM is not the fastest way to develop; how do we achieve faster development cycles and reduce schema mapping frictions? • How do we add new attributes to tables faster and without DBA’s involvement? Schema free approach sounds interesting. • Can we make the hint transparent, ex: auto-extract it from queries? • Can we rebalance the data seamlessly and automatically? • Can we add shards faster in order to scale out on demand and transparently to applications? • How do we deploy new DBs to the cloud on demand?  And what about performance? Can we use RAM more aggressively and seamlessly to speed up queries?
  • 4.
    Enters MongoDB  Weare playing with MongoDB since 2010. Why? Business Logic Document  Its scalability scheme is very similar to how we shard RDBMS Morphia/Mongo • Single master for writes, eventually consistent slaves for Driver reads Dynamic • Horizontal partitioning of data sets is a norm at eBay Config • MongoS is performing familiar scatter-gather and client- side merge-sorts MongoS F(Shard Key)  We don’t use distributed transactions since day 1; transactional updates of multiple tables … that we do use can be simulated by atomic <- Replicas -> updates of a single Mongo document  MongoDB offers a number of features that … help address our goals mentioned earlier: • Developers love document model and schema-free persistence … • Hints are embedded into the queries • MongoDB has automatic shard rebalancing • Shards can be added on demand without application restart and data will be auto-rebalanced ---------- Shards ------- • We can easily bring it up in the cloud since cloud machines have storage
  • 5.
    Case study #1:eBay Search Suggestions  Search suggestion list is a MongoDB document indexed by word prefix as well as by some metadata: product category, search domain, etc.  Must have < 60-70msec round trip end to end  MongoDB query < 1.4msec  Data set fits in RAM; 100-s M documents  Data is bulk loaded once a day from Hadoop, but can be tweaked on demand during sale promotions, etc  Single replica set, no shards in this case  MongoDB benefits: • Multiple indexes allow flexible lookups • In-memory data placement ensures lookup speed • Large data set is durable and replicated
  • 6.
    Case Study #2:Cloud Manager “State Hub” Query  State Hub powers eBay Cloud Provision Resources Resources  Every resource provisioned by the cloud is and Topology represented by a single Mongo document  Documents contain highly structured metadata reflecting roles and grouping of the resources  Lookup by both primary and secondary State Hub indexes Mongo  Several GB data sets, easily fit in RAM Update  Documents are not uniform resource state  All resources have “State” field which is updated periodically to reflect health state of the underlying resource  Mixed workload: lots of in-place writes, but also lots of read queries
  • 7.
    Case Study #3:eBay Merchandizing Info Cache  Merchandizing backend powers eBay product/item classification and categorization  Each MongoDB document represents a cluster of similar products  Numerous relationships between clusters are modeled as R1 document attributes Cluster1 Cluster2  Relationship hierarchy traversal is achieved by issuing a R3 number of queries on “edge” attributes R2  Each instance of such a hierarchy is called a model; there Cluster3 are lots of models  Again, data set fits in RAM, single replica set  Replica set members are located in 3 different data centers (3+2+2) with all members in a single data center having higher weight to avoid moving master away  MongoDB benefits: • Schema-free design and declarative indexes are perfect for this use case where new attributes and new queries are constantly being added • Async replication across multiple data centers • MongoDB Java Driver ensures automatic detection of proximity of clients to replica set members; reads with slaveOK=true are served from local data center nodes which insures low response latency
  • 8.
    Case Study #4:Zoom – Media Metadata Store  This is a new mega project which is a work in progress  MongoDB is being evaluated as a storage backend for all media-related metadata on the site (example: picture IDs with lots attributes)  Requirements: • Tens of TBs data set, Millions of documents: data set must be partitioned; this is our first use case where MongoDB sharding is used • System of record for picture info; data can not be lost! • Replication/DR across 2 data centers; local DC reads are required • Queries are from site-facing flows; <10msec response time SLA • Mixed workload: both inserts and reads are happening concurrently all the time  Can MongoDB do it ??
  • 9.
    Zoom: Data Model  2 main collections: Item and Image • Item references multiple Images  Item represents eBay Item: • _id in Item is external ID of the item in eBay site DB • These IDs are already sharded in balanced across N logical DB hosts using ID ranges • We use MongoDB pre-split points for initial mapping our N site DB shards to M MongoDB shards • This ensures good balance between the shards;  Image represents a picture attached to an Item • _id in Image is md5 of the image content • This ensures good distribution across any number of shards • Md5 is also used to find duplicate images  Our choice of document IDs in both collections ensures good balance across Mongo shards  We never query both collections in a single service request to ensure data consistency and to have only one index lookup
  • 10.
    Zoom: Service Topologyand Configuration  MongoS is deployed on app servers • Ensures network IO on MongoS won’t become a bottleneck • This is a very familiar pattern in eBay as was explained in the > --- DC1(Primary)--- beginning of this presentation  M shards; each replica set has 6 members M M M M • 3 + 3 in 2 data centers • Master can be only in one DC during automatic failover; manual failover may activate another DC --- Replicas ---> • One slave in the secondary DC is invisible for reads and is dedicated to periodic backups/snapshots (more on this later)  For reads, client first sets SlaveOK=true and if required document is not found flips to SlaveOK=false to read from Master -- DC2(Secondary)-->  Home-grown MongoDB configuration and monitoring agent is running on every node • Fetches MongoD configuration from a central configuration store and saves it to local config file • Manages lifecycle of MongoD B B B B • Monitors state and metrics ---- Shards -----
  • 11.
    Zoom: Data Backupand Restore strategy  Goals: • Take periodic backups of the entire data set Application • Be able to recover from backup • Do not loose any writes that have happened after last snapshot • Briefly service unavailability during recovery is better than data Dual-write loss … to capped M collection C  Dual writes on the client • Regular write to main cluster … • Second write to another Mongo cluster: single replica set, capped collection, the data written is similar to REDO log record Recovery B Agent  Hidden slave in each shard has volume mounted on a remote storage appliance capable of instant file system snapshot; captures both DB files and journal files  If DB recovery is activated: • All MongoD on primary cluster are shutdown • NFS slave is remounted to snapshot volume Instant • MongoD on this machine is started as a master Shapshot • MongoD on other replica set members are started cold • Full sync-up from master Capable • Master is switched to a regular member device • Writes that occurred since time when the backup was taken are replayed from the REDO log capped collection in the secondary cluster •
  • 12.
    Key Learning  MongoDBcan be a very powerful tool but use it wisely  Deletes can be slow; automatic balancer is dangerous; use it only when you must (example: be careful when adding new shards)  Use explain for every query; disable full scans to discover inefficiencies early  Query profiler is great  Retry every failed query at least once; long tail in response times is possible when data set > RAM size
  • 13.