MongoTokyo

5,664 views
5,589 views

Published on

Mongo Tokyo 2012 session

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,664
On SlideShare
0
From Embeds
0
Number of Embeds
2,128
Actions
Shares
0
Downloads
97
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

MongoTokyo

  1. 1. KVSの性能 RDBMSのインデックス 更にMapReduceを併せ持つ All-in-one NoSQLRakuten,inc DU Architect Group Hiroaki Kubota |2011/1/18 1
  2. 2. Introduction 2
  3. 3. Who am I ? 3
  4. 4. IntroductionProfileName: 窪田 博昭 Hiroaki KubotaCompany: Rakuten Inc.Unit: ACT = Development Unit Architect GroupMail: hiroaki.kubota@mail.rakuten.comHobby: Futsal , GolfRecent: My physical power has gradual declined...twitter : crumbjpgithub: crumbjp 4
  5. 5. IntroductionAgenda• Introduction• Mongo’s characteristic• How to take advantage of the mongo for our service – Our new system “cockatoo” – MapReduce• Structure & Performance• Performance example ( on EC2 large )• Major problems... – Indexing – STALE – Diskspace – PHP client• Closing 5
  6. 6. Mongo’s characteristics 6
  7. 7. Mongo’s characteristicMongo’s ... / Mongo has ... / Mongo is ...READ performance is extremely good !WRITE performance is so-so,but cannot be scalable.To READ data immediately after it is WRITTEN is bad.Very high availability !Under development. Maintenance tools are poor. Some useless operations. 7
  8. 8. How to take advantages of the Mongo for the infoseek news 8
  9. 9. Our new system “Cockatoo”(used to be call “Albatross”) 9
  10. 10. For instance of our page 10
  11. 11. Page structure 11
  12. 12. Layout / ComponentsLayout Components 12
  13. 13. Generic WEB structureInternet Internet Request WEB WEB Call APIs API API Retrieve data DB 13
  14. 14. Cockatoo structure Internet Internet Request SessionDBLayoutDB Gat page layout MongoDB WEB WEB ReplSetMongoDBReplSet Get components Call APIs Memcache API API Retrieve data ContentsDB MongoDB ReplSet 14
  15. 15. Cockatoo structure Internet Internet Request SessionDBLayoutDB Gat page layout Mongo’s READ performance MongoDB is WEB WEB enough to cope with ReplSetMongoDBReplSet WEB PV. Get components Call APIs Memcache But WRITE performance is not enough. API API Retrieve data ContentsDB MongoDB ReplSet 15
  16. 16. Cockatoo structure Internet Internet Request SessionDBLayoutDB Gat page layout MongoDB WEB WEB ReplSetMongoDBReplSet Get components Call APIs Memcache API API Retrieve data ContentsDB MongoDB ReplSet 16
  17. 17. Cockatoo structure Internet Internet Request SessionDBLayoutDB Gat page layout MongoDB WEB WEB ReplSetMongoDBReplSet Get components Call APIs Memcache API APIZookeeper Retrieve data ContentsDB MongoDB ReplSet 17
  18. 18. Cockatoo structure Internet Internet Request SessionDBLayoutDB Gat page layout MongoDB WEB WEB ReplSetMongoDBReplSet Get components Call APIs Memcache API APIZookeeper Solr Retrieve data ContentsDB MongoDB ReplSet 18
  19. 19. Cockatoo structure Developer HTML markupLayoutDB Set page layout & Deploy API API settings CMS Batch serversMongoDBReplSet Set components Insert Data API servers API servers Set static contents ContentsDB MongoDB ReplSet 19
  20. 20. CMSLayout editor 20
  21. 21. CMS 21
  22. 22. CMS 22
  23. 23. MapReduce 23
  24. 24. MapReduceOur usageWe have never used MapReduce as regular operation.However, We have used it for some irreglar case.• To search the invalid articles that should be removed because of someone’s mistakes...• To analyze the number of new articles posted a day.• To analyze the updated number an article.• We get start considering to use it regularly for the social data analyzing before long ... 24
  25. 25. Structure & Performance 25
  26. 26. StructureWe are using very poor machine (Virtual machine) !! • Intel(R) Xeon(R) CPU X5650 2.67GHz 1core!! • 4GB memory • 50 GB disk space ( iScsi ) • CentOS5.5 64bit • mongodb 1.8.0 – ReplicaSet 5 nodes ( + 1 Arbiter) – Oplog size 1.2GB – Average object size 1KB 26
  27. 27. StructureResearched environmentWe’ve also researched following environments...• Virtual machine 1 core – 1kb data , 6,000,000 documents – 8kb data , 200,000 documents• Virtual machine 3 core – 1kb data , 6,000,000 documents – 8kb data , 200,000 documents• EC2 large instance – 2kb data , 60,000,000 documents. ( 100GB ) 27
  28. 28. PerformanceI found the formula for making a rough estimation of QPS1~8 kb documents + 1 unique indexC = Number of CPU cores (Xeon 2.67 GHz)DD = Score of ‘dd’ command (byte/sec)S = Document size (byte)• GET qps = 4500 × C• SET(fsync) bytes/s = 0.05×DD ÷ S• SET(nsync) qps = 4500 BUT... have chance of STALE 28
  29. 29. Performance example (on EC2 large) 29
  30. 30. Performance example (on EC2 large)Environment and amount of dataEC2 large instance – 2kb data , 60,000,000 documents. ( 100GB ) – 1 unique indexData-type { shop: someone, item: something, description: item explanation sentences...‘ } 30
  31. 31. Performance example (on EC2 large)Batch insert (1000 documents) fsync=true17906 sec (=289 min) (=3358 docs/sec)Ensure index (background=false)4049 sec (=67min) 1.primary 2101 sec (=35min) 2.secondary 1948 sec (=32min) 31
  32. 32. Performance example (on EC2 large)Add one node5833sec (=97min) 1.Get files 2GB×48 2120 sec (=35min) 2._id indexing 1406 sec (=23min) 3.uniq indexing 2251 sec (=38min) 4.other processes 56 sec (=1 min) 32
  33. 33. Performance example (on EC2 large)Group by• Reduce by unique index & map & reduce –368 msec db.data.group({ key: { shop: 1}, cond: { shop: someone }, reduce: function ( o , p ) { p.sum++; }, initial: { sum: 0 } }); 33
  34. 34. Performance example (on EC2 large)MapReduce• Scan all data 3116sec (=52min) –number of key = 39092 db.data.mapReduce( function(){ emit(this.shop,1); }, function(k,v){ var ret=0; v.forEach( function (value){ ret+=value; }); return ret; }, { query: {}, inline: 1, out: Tmp } ); 34
  35. 35. Major problems... 35
  36. 36. Indexing 36
  37. 37. Index probremOnline indexisng is completely useless even if last version (2.0.2)Indexing is lock operation in default.Indexing operation can run as background on the primary. But...It CANNOT run as background on the secondaryMoreover the all secondary’s indexing run at the same time !!Result in above... All slave freezes ! orz... 37
  38. 38. Present indexing ( default ) 38
  39. 39. Index probremPresent indexing ( default ) Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 39
  40. 40. Index probremPresent indexing ( default ) Primary ensureIndex Lock Cannot Batch write Indexing Secondary Secondary Secondary Client Client Client Client Client 40
  41. 41. Index probremPresent indexing ( default ) Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Lock Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 41
  42. 42. Index probremPresent indexing ( default ) Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 42
  43. 43. Present indexing ( background ) 43
  44. 44. Index probremPresent indexing ( background ) Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 44
  45. 45. Index probrem Present indexing ( background )ensureIndex(background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 45
  46. 46. Index probremPresent indexing ( background ) Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Lock Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 46
  47. 47. Index probremPresent indexing ( background ) Primary finished Batch Background Complete don’t work indexing SYNC SYNC SYNC Secondary on the Lock secondaries Secondary Lock Secondary Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 47
  48. 48. Index probremPresent indexing ( background ) Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Lock Lock Lock Indexing Indexing Indexing Cannot read !! Client Client Client Client Client 48
  49. 49. Index probremPresent indexing ( background ) Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 49
  50. 50. Probable 2.1.X indexing 50
  51. 51. Index probremAccoding to mongodb.org this probrem will fix in 2.1.0But not released formally.So I checked out the source code up to date. Certainlly it’ll be fixed !Moreover it sounds like it’ll run as foreground when slave status isn’t SECONDARY (Does it means RECOVERING ?) 51
  52. 52. Index probremProbable 2.1.X indexing Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 52
  53. 53. Index probrem Probable 2.1.X indexingensureIndex(background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 53
  54. 54. Index probremProbable 2.1.X indexing Primary finished Batch Complete SYNC SYNC SYNC Secondary Secondary Secondary Slowdown Slowdown Slowdown Indexing Indexing Indexing Slow down... Client Client Client Client Client 54
  55. 55. Index probremProbable 2.1.X indexing Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 55
  56. 56. Index probremBackground indexing 2.1.XBut I think it’s not enough.I think it can bring failure to our system whenthe all secondaries slowdown at the same time !! So... 56
  57. 57. Ideal indexing 57
  58. 58. Index probremIdeal indexing Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 58
  59. 59. Index probrem Ideal indexingensureIndex(background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 59
  60. 60. Index probremIdeal indexing Primary finished Batch Complete ensureIndex Recovering Secondary Secondary Indexing Client Client Client Client Client 60
  61. 61. Index probremIdeal indexing Primary Batch Complete ensureIndex Secondary Recovering Secondary Complete Indexing Client Client Client Client Client 61
  62. 62. Index probremIdeal indexing Primary Batch Complete ensureIndex Secondary Secondary Recovering Complete Complete Indexing Client Client Client Client Client 62
  63. 63. Index probremIdeal indexing Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 63
  64. 64. Index probremBut ... I easilly guess it’s difficult to apply for current OplogIt would be great if I can operate indexing manually at each secondaries 64
  65. 65. I suggest Manual indexing 65
  66. 66. Index probremManual indexing Primary save Batch Secondary Secondary Secondary Client Client Client Client Client 66
  67. 67. Index probremManual indexingensureIndex(manual,background) Primary Slow down... Slowdown Batch Indexing Secondary Secondary Secondary Client Client Client Client Client 67
  68. 68. Index probremManual indexing Primary finished Batch Complete Secondary Secondary Secondary Client Client Client Client Client 68
  69. 69. Index probremManual indexing Primary finished Batch Complete Secondary Secondary Secondary The secondaries don’t sync automatically Client Client Client Client Client 69
  70. 70. Index probremManual indexing Primary finished Batch Complete Secondary Secondary Secondary Client Client Client Client Client 70
  71. 71. Index probremManual indexing Primary Batch Complete ensureIndex(manual) Recovering Secondary Secondary Indexing Client Client Client Client Client 71
  72. 72. Index probremManual indexing Primary Batch Complete ensureIndex(manual) Secondary Recovering Secondary Complete Indexing Client Client Client Client Client 72
  73. 73. Index probremManual indexing Primary Batch CompleteensureIndex(manual,background) Secondary Secondary Secondary Slowdown Complete Complete Indexing Client Client Client Client Client 73
  74. 74. Index probremManual indexing Primary Batch It needs to support CompleteensureIndex(manual,background) background operation Secondary Secondary Secondary Slowdown Complete Complete IndexingJust in case,if the ReplSet has only one Secondary Client Client Client Client Client 74
  75. 75. Index probremManual indexing Primary Batch CompleteensureIndex(manual,background) Secondary Secondary Secondary Slowdown Complete Complete Indexing Client Client Client Client Client 75
  76. 76. Index probremManual indexing Primary Batch Complete Secondary Secondary Secondary Complete Complete Complete Client Client Client Client Client 76
  77. 77. That’s all about Indexing problem 77
  78. 78. Struggle to control the sync 78
  79. 79. STALE 79
  80. 80. Unknown log & Out of control the ReplSetWe often suffered from going out of control the Secondaries...• Secondaries change status repeatedly in a moment between Secondary and Recovering (1.8.0)• Then we found the strange line in the log... [rsSync] replSet error RS102 too stale to catch up 80
  81. 81. What’s Stale ?stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp• 〈食品・飲料などが〉新鮮でない(⇔fresh);• 気の抜けた, 〈コーヒーが〉香りの抜けた,• 〈パンが〉ひからびた, 堅くなった,• 〈空気・臭(にお)いなどが〉むっとする,• いやな臭いのする 81
  82. 82. What’s Stale ?stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp• 〈食品・飲料などが〉新鮮でない(⇔fresh);• 気の抜けた, 〈コーヒーが〉香りの抜けた,• 〈パンが〉ひからびた, 堅くなった,• 〈空気・臭(にお)いなどが〉むっとする,• いやな臭いのするどうも非常によろしくないらしい・・・ 82
  83. 83. Mechanizm of being stale 83
  84. 84. ReplicaSet Client Clientmongod mongodDatabase Oplog Database Oplog Primary Secondary 84
  85. 85. Replication (simple case) 85
  86. 86. ReplicaSet Client Clientmongod mongodDatabase Oplog Database Oplog Primary Secondary 86
  87. 87. Insert & Replication 1 A Client Client Insertmongod mongod Insert A ADatabase Oplog Database Oplog Primary Secondary 87
  88. 88. Insert & Replication 1 Client Client Sync Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 88
  89. 89. Replication (busy case) 89
  90. 90. Stale Client Clientmongod mongod Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 90
  91. 91. Insert & Replication 2 B Client Client Insert Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 91
  92. 92. Insert & Replication 2 C Client Client Insert Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 92
  93. 93. Insert & Replication 2 A Client Client Update Update A Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 93
  94. 94. Insert & Replication 2 Client Client Check Oplog Update A Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 94
  95. 95. Insert & Replication 2 Client Client Sync Update A Update A Insert C Insert C C Insert B C Insert B B Insert A B Insert A A ADatabase Oplog Database Oplog Primary Secondary 95
  96. 96. Replication (more busy) 96
  97. 97. Stale Client Clientmongod mongod Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 97
  98. 98. Stale B Client Client Insert Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 98
  99. 99. Stale C Client Client Insert Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 99
  100. 100. Stale A Client Client Update Update A Insert C C Insert B B Insert A Insert A A ADatabase Oplog Database Oplog Primary Secondary 100
  101. 101. Stale C Client Client Update Update C Update A C Insert C B Insert B Insert A A Insert A ADatabase Oplog Database Oplog Primary Secondary 101
  102. 102. Stale D Client Client Insert Insert D D Update C C Update A B Insert C Insert A A Insert B ADatabase Insert A Database Oplog Primary Secondary 102
  103. 103. Stale Client Client [Inset A] not found !! Check Oplog Insert D D Update C C Update A B Insert C Insert A A Insert B ADatabase Insert A Database Oplog Primary Secondary 103
  104. 104. Stale Client Client [Inset A] not found !! Check Oplog It cannot get infomation about [Insert B]. Insert D D Update C C Update A So cannot sync !! B Insert C Insert A A Insert B A It’s called STALEDatabase Insert A Database Oplog Primary Recovering 104
  105. 105. StaleWe have to understand the importance of adjusting oplog sizeWe can specify the oplog size as one of the command line optionOnly at the first time per the dbpath that is also specified as a command line.Also we cannot change the oplog size without clearing the dbpath. Be careful ! 105
  106. 106. Replication (Join as a new node) 106
  107. 107. InitialSync Client Clientmongod Insert D D Update C C Update A B Insert C ADatabase Oplog Primary 107
  108. 108. InitialSync Client Clientmongod mongod Insert D D Update C C Update A B Insert C ADatabase Oplog Database Oplog Primary Startup 108
  109. 109. InitialSync Client Client Get last Oplog Insert D D Update C C Update A B Insert C Insert D ADatabase Oplog Database Oplog Primary Recovering 109
  110. 110. InitialSync D Client Client C B A Cloning DB Insert D D Update C C Update A B Insert C Insert D ADatabase Oplog Database Oplog Primary Recovering 110
  111. 111. InitialSync D Client Client C B A Cloning DB Insert D D Update C C Update A B Insert C Insert D A ADatabase Oplog Database Oplog Primary Recovering 111
  112. 112. InitialSync E D Client Client Insert C B A Cloning DB E Insert E D Insert D C Update C B B Update A Insert D A A Insert CDatabase Oplog Database Oplog Primary Recovering 112
  113. 113. InitialSync B Client Client Update Cloning DB complete E Update B D Insert E D C Insert D C B Update C B Insert D A Update A ADatabase Oplog Database Oplog Primary Recovering 113
  114. 114. InitialSync Client Client Check Oplog E Update B D Insert E D C Insert D C B Update C B Insert D A ADatabase Oplog Database Oplog Primary Recovering 114
  115. 115. InitialSync Client Client Sync E Update B E D Insert E D Update B C Insert D C Insert E B Update C B Insert D A ADatabase Oplog Database Oplog Primary Secondary 115
  116. 116. Additional infomationFrom source code. ( I’ve never examed these... )Secondary will try to sync from other Secondaries when it cannot reach the Primary or might be stale against the Primary. There is a bit of chance that sync problem not occured if the secondary has old Oplog or larger Oplog space than Primary 116
  117. 117. Sync from another secondary Client Client Insert D Insert D D Update C D Update C C Update A C Update A B Insert C Insert A B Insert C A Insert B A A Insert BDatabase Insert A Database Oplog Database Insert A Primary Secondary Secondary 117
  118. 118. Sync from another secondary Client [Inset A] Client not found !! Check Oplog Insert D Insert D D Update C D Update C C Update A C Update A B Insert C Insert A B Insert C A Insert B A A Insert BDatabase Insert A Database Oplog Database Insert A Primary Secondary Secondary 118
  119. 119. Sync from another secondary Client But found at the other secondary Client So it’s able to sync Check Oplog Insert D Insert D D Update C D Update C C Update A C Update A B Insert C Insert A B Insert C A Insert B A A Insert BDatabase Insert A Database Oplog Database Insert A Primary Secondary Secondary 119
  120. 120. Sync from the other secondary Client But found at the other secondary Client So it’s able to sync Sync Insert D Insert D Insert D D Update C D Update C D Update C C Update A C Update A C Update A B Insert C B Insert C B Insert C A Insert B A Insert B A Insert B Insert ADatabase Insert A Database Database Insert A Primary Secondary Secondary 120
  121. 121. That’s all about sync 121
  122. 122. Others... 122
  123. 123. Disk space 123
  124. 124. Disk spaceData fragment into any DB files sparsely... We met the unfavorable circumstance in our DBs This circumstance appears at some of our collections around 3 months after we launched the services db.ourcol.storageSize() = 16200727264 (15GB) db.ourcol.totalSize() = 16200809184 db.ourcol.totalIndexSize() = 81920 db.outcol.dataSize() = 2032300 (2MB) What’s happen to them !! 124
  125. 125. Disk spaceData fragment into any DB files sparsely... It’s seems like to be caused by the specific operation that insert , update and delete over and over. Anyway we have to shrink the using disk space regularly just like PostgreSQL’s vacume. But how to do it ? 125
  126. 126. Disk spaceShrink the using disk spaces MongoDB offers some functions for this case. But couldn’t use in our case ! repairdatabase: Only runable on the Primary. It needs long time and BLOCK all operations !! compact: Only runable on the Secondary. Zero-fill the blank space instead of shrink disk spaces. So cannot shrink... 126
  127. 127. Disk spaceOur measurementsFor temporary collection: To issue drop-command regularly.For other collections: 1.Get rid of one secondary from the ReplSet. 2.Shut down this. 3.Remove all DB files. 4.Join to the ReplSet. 5.Do these operations one after another. 6.Step down the Primary. (Change Primary node) 7.At last, do 1 – 4 operations on prior Primary. 127
  128. 128. Disk spaceShrink operation Primary Secondary Secondary Bloated Bloated Bloated 128
  129. 129. Disk spaceShrink operation shutdown mongod (kill -15) Primary Dead Secondary Bloated Bloated Bloated 129
  130. 130. Disk spaceShrink operation delete DBPATH Primary Dead Secondary Bloated Nothing Bloated 130
  131. 131. Disk spaceShrink operation start mongod Primary Startup Secondary Bloated Nothing Bloated 131
  132. 132. Disk spaceShrink operation Primary Secondary Secondary Bloated Shrinked Bloated 132
  133. 133. Disk spaceShrink operation shutdown mongod delete DBPATH startup mongod Primary Secondary Secondary Bloated Shrinked Shrinked 133
  134. 134. Disk spaceShrink operation step down Secondary Primary Secondary Bloated Shrinked Shrinked 134
  135. 135. Disk spaceShrink operation shutdown mongod delete DBPATH startup mongod Secondary Primary Secondary Shrinked Shrinked Shrinked 135
  136. 136. PHP client 136
  137. 137. PHP clientWe tried 1.1.4 and 1.2.21.1.4: There is some critical bugs around connection pool. We struggled to invalidate the broken connection. I think, you should use 1.2.X instead of 1.1.X1.2.2: It seems like to be fixed around connection pool. But there are 2 critical bugs ! –Socket handle leak –Useless sleep However, This version is relatively stable 137 as long as to fix these bugs
  138. 138. PHP clientPatcheshttps://github.com/crumbjp/Personal - mongo1.2.2.non-wait.patch - mongo1.2.2.sock-leak.patch 138
  139. 139. PHP client 139
  140. 140. Closing 140
  141. 141. Closing What’s MongoDB ?It has very good READ performance. We can use mongo instead of memcached. if we can allow the limited write performance.Die hard ! MongoDB have high availability even if under a severe stress..Can use easilly without deep consideration We can manage to do anything after getting start to use. Let’s forget any awkward trivial things that have bothered us. How to treat the huge data ? How to put in the cache system ? How to keep the availablity ? And so on .... 141
  142. 142. ClosingKeep in mindSharding is challenging... It’s last resort ! It’s hard to operate. In particular, to maintain config-servers. [Mongos] is also difficult to keep alive. I want the way to failover Mongos.Mongo is able to run on the poor environment but... You should ONLY put aside the large diskspaceHuge write is sensitive Adjust the oplog size carefullyIndexing function has been unfinished Cannot apply index online 142
  143. 143. All right, Have fun !! 143
  144. 144. All right, Have fun !! ...with us at Rakuten 144
  145. 145. All right, Have fun !! ...with us at RakutenPlease join Rakuten for cool work? 145
  146. 146. Thank you for your listening 146

×