KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

4,781 views
4,699 views

Published on

窪田 博昭、楽天株式会社
『Mongo Tokyo 2012』講演資料

Rakuten MongoDBの特徴、それを活かした使い方などを楽天・インフォシークニュースの事例などを通じて紹介します。
また機能、性能の検証結果、運用ノウハウ、弱点!の共有。PHPドライバのパッチなども公開します。

Published in: Technology

KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

  1. 1. KVSの性能 RDBMSのインデックス 更にMapReduceを併せ持つ All-in-one NoSQL楽 天 株 式 会 社 開 発 部 ア ーキ テ ク ト G 窪 田 博 昭 | 2 0 1 2 年 1 月 1 8 日 1
  2. 2. IntroductionAgenda• Introduction• How to use mongo on the news.infoseek.co.jp 2
  3. 3. Introduction 3
  4. 4. Who am I ? 4
  5. 5. IntroductionProfileName: 窪田 博昭 Hiroaki KubotaCompany: Rakuten Inc.Unit: ACT = Development Unit Architect GroupMail: hiroaki.kubota@mail.rakuten.comHobby: Futsal , GolfRecent: My physical power has gradual declined...twitter : crumbjpgithub: crumbjp 5
  6. 6. How to take advantages of the Mongo for the infoseek news 6
  7. 7. For instance of our page 7
  8. 8. Page structure 8
  9. 9. Layout / ComponentsLayout Components 9
  10. 10. Albatross structure Internet Request SessionDBLayoutDB Gat page layout MongoDB WEB ReplSetMongoDBReplSet Get components Call APIs Memcache API Retrieve data ContentsDB MongoDB ReplSet 10
  11. 11. Albatross structure Developer HTML markupLayoutDB Set page layout & Deploy API API settings CMS Batch serversMongoDBReplSet Set components Insert Data API servers ContentsDB MongoDB ReplSet 11
  12. 12. CMSLayout editor 12
  13. 13. CMS 13
  14. 14. CMS 14
  15. 15. MapReduce 15
  16. 16. MapReduceOur usageWe have never used MapReduce as regular operation.However, We have used it for some irreglar case.• To search the invalid articles that should be removed because of someone’s mistakes...• To analyze the number of new articles posted a day.• To analyze the updated number an article.• We get start considering to use it regularly for the social data analyzing before long ... 16
  17. 17. Structure & Performance 17
  18. 18. StructureWe are using very poor machine (Virtual machine) !!• Intel(R) Xeon(R) CPU X5650 2.67GHz 1core!!• 4GB memory• 50 GB disk space ( iScsi )• CentOS5.5 64bit• mongodb 1.8.0 – ReplicaSet 5 nodes ( + 1 Arbiter) – Oplog size 1.2GB – Average object size 1KB 18
  19. 19. StructureResearched environmentWe’ve also researched following environments...• Virtual machine 1 core – 1kb data , 6,000,000 documents – 8kb data , 200,000 documents• Virtual machine 3 core – 1kb data , 6,000,000 documents – 8kb data , 200,000 documents• EC2 large instance – 2kb data , 60,000,000 documents. ( 100GB ) 19
  20. 20. PerformanceI found the formula for making a rough estimation of QPS1~8 kb documents + 1 unique indexC = Number of CPU cores (Xeon 2.67 GHz)DD = Score of ‘dd’ command (byte/sec)S = Document size (byte)• GET qps = 4500 × C• SET(fsync) bytes/s = 0.05×DD ÷ S• SET(nsync) qps = 4500 BUT... have chance of STALE 20
  21. 21. Performance example (on EC2 large) 21
  22. 22. Performance example (on EC2 large)Environment and amount of dataEC2 large instance – 2kb data , 60,000,000 documents. ( 100GB ) – 1 unique indexData-type { shop: someone, item: something, description: item explanation sentences...‘ } 22
  23. 23. Performance example (on EC2 large)Batch insert (1000 documents) fsync=true17906 sec (=289 min) (=3358 docs/sec)Ensure index (background=false)4049 sec (=67min) 1. primary 2101 sec (=35min) 2. secondary 1948 sec (=32min) 23
  24. 24. Performance example (on EC2 large)Add one node5833sec (=97min) 1. Get files 2GB×48 2120 sec (=35min) 2. _id indexing 1406 sec (=23min) 3. uniq indexing 2251 sec (=38min) 4. other processes 56 sec (=1 min) 24
  25. 25. Performance example (on EC2 large)Group by• Reduce by unique index & map & reduce – 368 msec db.data.group({ key: { shop: 1}, cond: { shop: someone }, reduce: function ( o , p ) { p.sum++; }, initial: { sum: 0 } }); 25
  26. 26. Performance example (on EC2 large)MapReduce• Scan all data 3116sec (=52min) – number of key = 39092 db.data.mapReduce( function(){ emit(this.shop,1); }, function(k,v){ var ret=0; v.forEach( function (value){ ret+=value; }); return ret; }, { query: {}, inline: 1, out: Tmp } ); 26
  27. 27. Major problems... 27
  28. 28. Indexing 28
  29. 29. Index probremOnline indexisng is completely useless even if last version (2.0.2)Indexing is lock operation in default.Indexing operation can run as background on the primary. But...It CANNOT run as background on the secondaryMoreover the all secondary’s indexing run at the same time !!Result in above... All slave freezes ! orz... 29
  30. 30. Present indexing ( default ) 30
  31. 31. Index probremPresent indexing ( default ) Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 31
  32. 32. Index probremPresent indexing ( default ) Primary ensureIndex Lock Cannot Batch write Indexing Secondary Secondary Secondary Client Client Client Client Client 32
  33. 33. Index probremPresent indexing ( default ) Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Lock Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 33
  34. 34. Index probremIdeal indexing ( default ) Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 34
  35. 35. Present indexing ( background ) 35
  36. 36. Index probremPresent indexing ( background ) Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 36
  37. 37. Index probrem Present indexing ( background )ensureIndex(background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 37
  38. 38. Index probremPresent indexing ( background ) Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Lock Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 38
  39. 39. Index probremPresent indexing ( background ) Primary finished Batch Background Complete don’t work indexing SYNC SYNC SYNC Secondary Secondary Secondary on the Lock secondaries Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 39
  40. 40. Index probremPresent indexing ( background ) Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Lock Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 40
  41. 41. Index probremIdeal indexing ( background ) Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 41
  42. 42. Probable 2.1.X indexing 42
  43. 43. Index probremAccoding to mongodb.org this probrem will fix in 2.1.0But not released formally.So I checked out the source code up to date. Certainlly it’ll be fixed !Moreover it sounds like it’ll run as foreground when slave status isn’t SECONDARY (it means RECOVERING ) 43
  44. 44. Index probremProbable 2.1.X indexing Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 44
  45. 45. Index probrem Probable 2.1.X indexingensureIndex(background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 45
  46. 46. Index probremProbable 2.1.X indexing Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Slowdown Slowdown Slowdown Indexing Indexing Indexing Slow down... Client Client Client Client Client 46
  47. 47. Index probremProbable 2.1.X indexing Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 47
  48. 48. Index probremBackground indexing 2.1.XBut I think it’s not enough.I think it can be fatal for the system thatthe all secondaries slowdown at the same time !! So... 48
  49. 49. Ideal indexing 49
  50. 50. Index probremIdeal indexing Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 50
  51. 51. Index probrem Ideal indexingensureIndex(background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 51
  52. 52. Index probremIdeal indexing Primary finished Batch Complete ensureIndex Recovering Secondary Secondary Indexing Client Client Client Client Client 52
  53. 53. Index probremIdeal indexing Primary Batch Complete ensureIndex Secondary Recovering Secondary Complete Indexing Client Client Client Client Client 53
  54. 54. Index probremIdeal indexing Primary Batch Complete ensureIndex Secondary Secondary Recovering Complete Complete Indexing Client Client Client Client Client 54
  55. 55. Index probremIdeal indexing Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 55
  56. 56. Index probremBut ... I easilly guess it’s difficult to apply for current OplogIt would be great if I can operate indexing manually at each secondaries 56
  57. 57. I suggest Manual indexing 57
  58. 58. Index probremManual indexing Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 58
  59. 59. Index probremManual indexing PrimaryensureIndex(manual,background) Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 59
  60. 60. Index probremManual indexing Primary finished Batch Complete Secondary Secondary Secondary Client Client Client Client Client 60
  61. 61. Index probremManual indexing Primary finished Batch Complete Secondary Secondary Secondary The secondaries don’t sync automatically Client Client Client Client Client 61
  62. 62. Index probremManual indexing Primary finished Batch Complete Secondary Secondary Secondary Client Client Client Client Client 62
  63. 63. Index probremManual indexing Primary Batch Complete ensureIndex(manual) Recovering Secondary Secondary Indexing Client Client Client Client Client 63
  64. 64. Index probremManual indexing Primary Batch Complete ensureIndex(manual) Secondary Recovering Secondary Complete Indexing Client Client Client Client Client 64
  65. 65. Index probremManual indexing Primary Batch CompleteensureIndex(manual,background) Secondary Secondary Secondary Slowdown Complete Complete Indexing Client Client Client Client Client 65
  66. 66. Index probremManual indexing Primary Batch Complete It needs to supportensureIndex(manual,background) background operation Secondary Secondary Secondary Slowdown Complete Complete IndexingJust in case,if the ReplSet has only one Secondary Client Client Client Client Client 66
  67. 67. Index probremManual indexing Primary Batch CompleteensureIndex(manual,background) Secondary Secondary Secondary Slowdown Complete Complete Indexing Client Client Client Client Client 67
  68. 68. Index probremManual indexing Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 68
  69. 69. That’s all about Indexing problem 69
  70. 70. Struggle to control the sync 70
  71. 71. STALE 71
  72. 72. Unknown log & Out of control the ReplSetWe often suffered from going out of control the Secondaries...• Secondaries change status repeatedly in a moment between Secondary and Recovering (1.8.0)• Then we found the strange line in the log...[rsSync] replSet error RS102 too stale to catch up 72
  73. 73. What’s Stale ?stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp• 〈食品・飲料などが〉新鮮でない(⇔fresh);• 気の抜けた, 〈コーヒーが〉香りの抜けた,• 〈パンが〉ひからびた, 堅くなった,• 〈空気・臭(にお)いなどが〉むっとする,• いやな臭いのする 73
  74. 74. What’s Stale ?stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp• 〈食品・飲料などが〉新鮮でない(⇔fresh);• 気の抜けた, 〈コーヒーが〉香りの抜けた,• 〈パンが〉ひからびた, 堅くなった,• 〈空気・臭(にお)いなどが〉むっとする,• いやな臭いのするどうも非常によろしくないらしい・・・ 74
  75. 75. Mechanizm of being stale 75
  76. 76. ReplicaSet Clientmongod mongodDatabase Oplog Database Oplog Primary Secondary 76
  77. 77. Replication (simple case) 77
  78. 78. ReplicaSet Clientmongod mongodDatabase Oplog Database Oplog Primary Secondary 78
  79. 79. Insert & Replication 1 A Client Insertmongod mongod Insert A ADatabase Oplog Database Oplog Primary Secondary 79
  80. 80. Insert & Replication 1 Client Sync Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 80
  81. 81. Replication (busy case) 81
  82. 82. Stale Clientmongod mongod Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 82
  83. 83. Insert & Replication 2 B Client Insert Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 83
  84. 84. Insert & Replication 2 C Client Insert Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 84
  85. 85. Insert & Replication 2 A Client Update Update A Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 85
  86. 86. Insert & Replication 2 Client Check Oplog Update A Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 86
  87. 87. Insert & Replication 2 Client Sync Update A Update A Insert C Insert C C Insert B C Insert B B Insert A B Insert A A ADatabase Oplog Database Oplog Primary Secondary 87
  88. 88. Replication (more busy) 88
  89. 89. Stale Clientmongod mongod Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 89
  90. 90. Stale B Client Insert Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 90
  91. 91. Stale C Client Insert Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 91
  92. 92. Stale A Client Update Update A Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 92
  93. 93. Stale C Client Update Update C Update A C Insert C B Insert B Insert A A Insert A ADatabase Oplog Database Oplog Primary Secondary 93
  94. 94. Stale D Client Insert Insert D D Update C C Update A B Insert C Insert A A Insert B ADatabase Insert A Database Oplog Primary Secondary 94
  95. 95. Stale Client [Inset A] not found !! Check Oplog Insert D D Update C C Update A B Insert C Insert A A Insert B ADatabase Insert A Database Oplog Primary Secondary 95
  96. 96. Stale Client [Inset A] not found !! Check Oplog It cannot get infomation about [Insert B]. Insert D D Update C C Update A So cannot sync !! B Insert C Insert A A Insert B A It’s called STALEDatabase Insert A Database Oplog Primary Recovering 96
  97. 97. StaleWe have to understand the importance of adjusting oplog sizeWe can specify the oplog size as one of the command line optionOnly at the first time per the dbpath that is also specified as a command line.Also we cannot change the oplog size without clearing the dbpath. Be careful ! 97
  98. 98. Replication (Join as a new node) 98
  99. 99. InitialSync Clientmongod Insert D D Update C C Update A B Insert C ADatabase Oplog Primary 99
  100. 100. InitialSync Clientmongod mongod Insert D D Update C C Update A B Insert C ADatabase Oplog Database Oplog Primary Startup 100
  101. 101. InitialSync Client Get last Oplog Insert D D Update C C Update A B Insert C Insert D ADatabase Oplog Database Oplog Primary Recovering 101
  102. 102. InitialSync D Client C B A Cloning DB Insert D D Update C C Update A B Insert C Insert D ADatabase Oplog Database Oplog Primary Recovering 102
  103. 103. InitialSync D Client C B A Cloning DB Insert D D Update C C Update A B Insert C Insert D A ADatabase Oplog Database Oplog Primary Recovering 103
  104. 104. InitialSync E D Client Insert C B A Cloning DB E Insert E D Insert D C Update C B B Update A Insert D A A Insert CDatabase Oplog Database Oplog Primary Recovering 104
  105. 105. InitialSync B Client Update Cloning DB complete E Update B D Insert E D C Insert D C B Update C B Insert D A Update A ADatabase Oplog Database Oplog Primary Recovering 105
  106. 106. InitialSync Client Check Oplog E Update B D Insert E D C Insert D C B Update C B Insert D A ADatabase Oplog Database Oplog Primary Recovering 106
  107. 107. InitialSync Client Sync E Update B E D Insert E D Update B C Insert D C Insert E B Update C B Insert D A ADatabase Oplog Database Oplog Primary Secondary 107
  108. 108. Additional infomationFrom source code. ( I’ve never examed these... )Secondary will try to sync from other Secondaries when it cannot reach the Primary or might be stale against the Primary. There is a bit of chance that sync problem not occured if the secondary has old Oplog or larger Oplog space than Primary 108
  109. 109. Sync from another secondary Client Insert D Insert D D Update C D Update C C Update A C Update A B Insert C Insert A B Insert C A Insert B A A Insert BDatabase Insert A Database Oplog Database Insert A Primary Secondary Secondary 109
  110. 110. Sync from another secondary Client [Inset A] not found !! Check Oplog Insert D Insert D D Update C D Update C C Update A C Update A B Insert C Insert A B Insert C A Insert B A A Insert BDatabase Insert A Database Oplog Database Insert A Primary Secondary Secondary 110
  111. 111. Sync from another secondary Client But found at the other secondary So it’s able to sync Check Oplog Insert D Insert D D Update C D Update C C Update A C Update A B Insert C Insert A B Insert C A Insert B A A Insert BDatabase Insert A Database Oplog Database Insert A Primary Secondary Secondary 111
  112. 112. Sync from the other secondary Client But found at the other secondary So it’s able to sync Sync Insert D Insert D Insert D D Update C D Update C D Update C C Update A C Update A C Update A B Insert C B Insert C B Insert C A Insert B A Insert B A Insert B Insert A Insert A Insert ADatabase Database Database Primary Secondary Secondary 112
  113. 113. That’s all about sync 113
  114. 114. Others... 114
  115. 115. Disk space 115
  116. 116. Disk spaceData fragment into any DB files sparsely...We met the unfavorable circumstance in our DBsThis circumstance appears at some of our collections around 3 months after we launched the servicesdb.ourcol.storageSize() = 16200727264 (15GB)db.ourcol.totalSize() = 16200809184db.ourcol.totalIndexSize() = 81920db.outcol.dataSize() = 2032300 (2MB) What’s happen to them !! 116
  117. 117. Disk spaceData fragment into any DB files sparsely...It’s seems like to be caused by the specific operation that insert , update and delete over and over.Anyway we have to shrink the using disk space regularly just like PostgreSQL’s vacume. But how to do it ? 117
  118. 118. Disk spaceShrink the using disk spacesMongoDB offers some functions for this case. But couldn’t use in our case !repairdatabase: Only runable on the Primary. It needs long time and BLOCK all operations !!compact: Only runable on the Secondary. Zero-fill the blank space instead of shrink disk spaces. So cannot shrink... 118
  119. 119. Disk spaceOur measurementsFor temporary collection: To issue drop-command regularly.For other collections: 1. Get rid of one secondary from the ReplSet. 2. Shut down this. 3. Remove all DB files. 4. Join to the ReplSet. 5. Do these operations one after another. 6. Step down the Primary. (Change Primary node) 7. At last, do 1 – 4 operations on prior Primary. 119
  120. 120. PHP client 120
  121. 121. PHP clientWe tried 1.4.4 and 1.2.21.4.4: There is some critical bugs around connection pool. We struggled to invalidate the broken connection. I think, you should use 1.2.X instead of 1.4.X1.2.2: It seems like to be fixed around connection pool. But there are 2 critical bugs ! – Socket handle leak – Useless sleep However, This version is relatively stable 121 as long as to fix these bugs
  122. 122. PHP clientWe tried 1.4.4 and 1.2.2https://github.com/crumbjp/Personal - mongo1.2.2.non-wait.patch - mongo1.2.2.sock-leak.patch 122
  123. 123. PHP client 123
  124. 124. Closing 124
  125. 125. ClosingWhat’s MongoDB ?It has very good READ performance. We can use mongo instead of memcached. if we can allow the limited write performance.Die hard ! MongoDB have high availability even if under a severe stress..Can use easilly without deep consideration We can manage to do anything after getting start to use. Let’s forget any awkward trivial things that have bothered us. How to treat the huge data ? How to put in the cache system ? How to keep the availablity ? And so on .... 125
  126. 126. ClosingKeep in mindSharding is challenging... It’s last resort ! It’s hard to operate. In particular, to maintain config-servers. [Mongos] is also difficult to keep alive. I want the way to failover Mongos.Mongo is able to run on the poor environment but... You should ONLY put aside the large diskspaceHuge write is sensitive Adjust the oplog size carefullyIndexing function has been unfinished Cannot apply index online 126
  127. 127. All right, Have fun !! 127
  128. 128. Thank you for your listening 128

×