MongoDB:
Scaling write performance
                            Junegunn Choi
MongoDB

• Document    data store

 • JSON-like   document

• Secondary   indexes

• Automatic   failover

• Automatic   sharding
First impression:
                     Easy

• Easy   installation

• Easy   data model

• No   prior schema design

• Native   support for secondary indexes
Second thought:
                Not so easy

• No   SQL

• Coping    with massive data growth

• Setting   up and operating sharded cluster

• Scaling   write performance
Today we’ll talk about
 insert performance
Insert throughput
  on a replica set
Steady 5k inserts/sec




              * 1kB record. ObjectId as PK
              * WriteConcern: Journal sync on Majority
Insert throughput
    on a replica set
with a secondary index
Culprit:
               B+Tree index

• Good    at sequential insert

 • e.g. ObjectId, Sequence       #, Timestamp

• Poor   at random insert

 • Indexes   on randomly-distributed data
Sequential vs. Random insert
  1                                           55
  2                                           75
  3                                           78
  4                                            1
  5                                           99
  6                                           36
  7                                           80
  8                                           91
  9                                           52
 10          B+Tree                           63      B+Tree
 11                                           56
 12                                           33
                            working set               working set




 Sequential insert ➔ Small working set    Random insert ➔ Large working set
   ➔ Fits in RAM ➔ Sequential I/O         ➔ Cannot fit in RAM ➔ Random I/O
         (bandwidth-bound)                          (IOPS-bound)
So, what do we do now?
1. Partitioning

                        Aug 2012   Sep 2012     Oct 2012




      B+Tree

                                              fits in memory

does not fit in memory
1. Partitioning

• MongoDB       doesn’t support partitioning

• Partitioning   at application-level

• e.g. Daily   log collection

  • logs_20121012
Switch collection every hour
2. Better H/W

• More   RAM

• More   IOPS

 • RAID   striping

 • SSD

 • AWS    Provisioned IOPS (1k ~ 10k)
3. More H/W: Sharding
• Automatic   partitioning across nodes

   SHARD1             SHARD2              SHARD3




                      mongos router
3 shards (3x3)
3 shards (3x3)
on RAID 1+0
There’s no free lunch
•   Manual partitioning

    •   Incidental complexity

•   Better H/W

    •   $

•   Sharding

    •   $$

    •   Operational complexity
“Do you really need that index?”
Scaling insert performance
       with sharding
=
Choosing the right shard key
Shard key example:
          year_of_birth
                                 64MB chunk

    ~ 1950    1971 ~ 1990     1951 ~ 1970

1991 ~ 2005   2006 ~ 2010

                2010 ~ ∞

  USERS         USERS           USERS
 SHARD1        SHARD2          SHARD3



              mongos router
5k inserts/sec w/o sharding
Sequential key


• ObjectId   as shard key

• Sequence   #

• Timestamp
Worse throughput with 3x H/W.
Sequential key

                                               1000 ~ 2000
• All   inserts into one chunk
                                               5000 ~ 7500
• Cannot    scale insert performance
                                                  9000 ~ ∞
• Chunk    migration overhead                     USERS
                                                SHARD-x

                                 9001, 9002, 9003, 9004, ...
Sequential key
Hash key


• e.g. SHA1(_id)   = 9f2feb0f1ef425b292f2f94 ...

• Distributes   inserts evenly across all chunks
Hash key


• Performance   drops as collection grows

 • Why?   Mandatory index on shard key

     • B+Tree   problem again!
Sequential key
  Hash key
Sequential + hash key
• Coarse-grained    sequential prefix

• e.g. Year-month   + hash value

  • 201210_24c3a5b9

                                          B+Tree



                               201208_*   201209_*   201210_*
But what if...


                         B+Tree




                             large working set

   201208_*   201209_*            201210_*
Sequential + hash key

• Can   you predict data growth rate?

• Balancer   not clever enough

  • Only   considers # of chunks

  • Migration   slow during heavy-writes
Sequential key
     Hash key
Sequential + hash key
Low-cardinality hash key
• Small   portion of hash value                   Shard key range: A ~ D


  • e.g. A~Z, 00~FF

• Alleviates   B+Tree problem
                                               Local
  • Sequential   access on fixed #             B+Tree
   of parts

    • Cardinality   / # of shards
                                    A A   A   B     B   B   C C C
Low-cardinality hash key

• Limits   the # of possible chunks

  • e.g. 00   ~ FF ➔ 256 chunks

  • Chunk     grows past 64MB

    • Balancing   becomes difficult
Sequential key
       Hash key
 Sequential + hash key
Low-cardinality hash key
Low-cardinality hash prefix
      + sequential part
                                         Shard key range: A000 ~ C999


• e.g. Short   hash prefix + timestamp

• Nice   index access pattern              Local
                                          B+Tree
• Unlimited    number of chunks


                                A000 A123 B000 B123 C000 C123
Finally, 2x throughput
Lessons learned
• Know     the performance impact of secondary index

• Choose     the right shard key

• Test   with large data sets

• Linear   scalability is hard

  • If   you really need it, consider HBase or Cassandra

  • SSD
Thank you. Questions?

         유응섭 rspeed@daumcorp.com
         최준건 gunn@daumcorp.com

Mongodb - Scaling write performance

  • 1.
  • 2.
    MongoDB • Document data store • JSON-like document • Secondary indexes • Automatic failover • Automatic sharding
  • 3.
    First impression: Easy • Easy installation • Easy data model • No prior schema design • Native support for secondary indexes
  • 4.
    Second thought: Not so easy • No SQL • Coping with massive data growth • Setting up and operating sharded cluster • Scaling write performance
  • 5.
    Today we’ll talkabout insert performance
  • 6.
    Insert throughput on a replica set
  • 7.
    Steady 5k inserts/sec * 1kB record. ObjectId as PK * WriteConcern: Journal sync on Majority
  • 8.
    Insert throughput on a replica set with a secondary index
  • 10.
    Culprit: B+Tree index • Good at sequential insert • e.g. ObjectId, Sequence #, Timestamp • Poor at random insert • Indexes on randomly-distributed data
  • 11.
    Sequential vs. Randominsert 1 55 2 75 3 78 4 1 5 99 6 36 7 80 8 91 9 52 10 B+Tree 63 B+Tree 11 56 12 33 working set working set Sequential insert ➔ Small working set Random insert ➔ Large working set ➔ Fits in RAM ➔ Sequential I/O ➔ Cannot fit in RAM ➔ Random I/O (bandwidth-bound) (IOPS-bound)
  • 12.
    So, what dowe do now?
  • 13.
    1. Partitioning Aug 2012 Sep 2012 Oct 2012 B+Tree fits in memory does not fit in memory
  • 14.
    1. Partitioning • MongoDB doesn’t support partitioning • Partitioning at application-level • e.g. Daily log collection • logs_20121012
  • 15.
  • 16.
    2. Better H/W •More RAM • More IOPS • RAID striping • SSD • AWS Provisioned IOPS (1k ~ 10k)
  • 18.
    3. More H/W:Sharding • Automatic partitioning across nodes SHARD1 SHARD2 SHARD3 mongos router
  • 19.
  • 20.
  • 21.
    There’s no freelunch • Manual partitioning • Incidental complexity • Better H/W • $ • Sharding • $$ • Operational complexity
  • 22.
    “Do you reallyneed that index?”
  • 23.
  • 24.
  • 25.
    Shard key example: year_of_birth 64MB chunk ~ 1950 1971 ~ 1990 1951 ~ 1970 1991 ~ 2005 2006 ~ 2010 2010 ~ ∞ USERS USERS USERS SHARD1 SHARD2 SHARD3 mongos router
  • 26.
  • 27.
    Sequential key • ObjectId as shard key • Sequence # • Timestamp
  • 28.
  • 29.
    Sequential key 1000 ~ 2000 • All inserts into one chunk 5000 ~ 7500 • Cannot scale insert performance 9000 ~ ∞ • Chunk migration overhead USERS SHARD-x 9001, 9002, 9003, 9004, ...
  • 30.
  • 31.
    Hash key • e.g.SHA1(_id) = 9f2feb0f1ef425b292f2f94 ... • Distributes inserts evenly across all chunks
  • 33.
    Hash key • Performance drops as collection grows • Why? Mandatory index on shard key • B+Tree problem again!
  • 34.
  • 35.
    Sequential + hashkey • Coarse-grained sequential prefix • e.g. Year-month + hash value • 201210_24c3a5b9 B+Tree 201208_* 201209_* 201210_*
  • 36.
    But what if... B+Tree large working set 201208_* 201209_* 201210_*
  • 37.
    Sequential + hashkey • Can you predict data growth rate? • Balancer not clever enough • Only considers # of chunks • Migration slow during heavy-writes
  • 38.
    Sequential key Hash key Sequential + hash key
  • 39.
    Low-cardinality hash key •Small portion of hash value Shard key range: A ~ D • e.g. A~Z, 00~FF • Alleviates B+Tree problem Local • Sequential access on fixed # B+Tree of parts • Cardinality / # of shards A A A B B B C C C
  • 41.
    Low-cardinality hash key •Limits the # of possible chunks • e.g. 00 ~ FF ➔ 256 chunks • Chunk grows past 64MB • Balancing becomes difficult
  • 42.
    Sequential key Hash key Sequential + hash key Low-cardinality hash key
  • 43.
    Low-cardinality hash prefix + sequential part Shard key range: A000 ~ C999 • e.g. Short hash prefix + timestamp • Nice index access pattern Local B+Tree • Unlimited number of chunks A000 A123 B000 B123 C000 C123
  • 44.
  • 45.
    Lessons learned • Know the performance impact of secondary index • Choose the right shard key • Test with large data sets • Linear scalability is hard • If you really need it, consider HBase or Cassandra • SSD
  • 46.
    Thank you. Questions? 유응섭 rspeed@daumcorp.com 최준건 gunn@daumcorp.com