jgoulah@etsy.com/@johngoulahCrossingtheProductionBarrierDevelopmentAtScale
The world’s handmade marketplaceplatform for people to sell homemade, crafts, and vintage goods
42MMuniquevisitors/mo.
1.5B+pageviews/mo.42MMuniquevisitors/mo.
1.5B+pageviews/mo.42MMuniquevisitors/mo.850Kshops/200countries
1.5B+pageviews/mo.895MMsalesin201242MMuniquevisitors/mo.850Kshops/200countries
big cluster, 20 shards and adding 5 more
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to th...
4TBInnoDBbufferpoolover 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM n...
4TBInnoDBbufferpool20TB+datastoredover 40% increase from last year in QPS (25K last year)additional 30K moving over from po...
60K+queries/secavg4TBInnoDBbufferpool20TB+datastoredover 40% increase from last year in QPS (25K last year)additional 30K m...
60K+queries/secavg4TBInnoDBbufferpool20TB+datastored~1.2Gbpsoutbound(plaintext)over 40% increase from last year in QPS (25K...
60K+queries/secavg4TBInnoDBbufferpool20TB+datastored99.99%queriesunder1ms~1.2Gbpsoutbound(plaintext)over 40% increase from ...
50+MySQLservers/800CPUsServerSpecHPDL380G796GBRAM16spindles/1TBRAID1024Core16 x 146GB
TheProblembeen around since ’05,hit this a few years ago, every big company probably has this issue
DATAsync prod to dev, until prod data gets too bighttp://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/photostream/
SomeApproachessubsets have to end somewhere (a shop has favorites that are connectedto people, connected to shops, etc)gen...
SomeApproachessubsetsofdatasubsets have to end somewhere (a shop has favorites that are connectedto people, connected to s...
SomeApproachessubsetsofdatagenerateddatasubsets have to end somewhere (a shop has favorites that are connectedto people, c...
But...but there is a problem with both of those approaches
EdgeCaseswhat about testing edge cases, difficult to diagnose bugs?hard to model the same data set that produced a user fa...
Perspectiveanother issue is testing problems at scale, complex and large gobs ofdatareal social network ecosystem can be d...
Prod Dev?what most people do before data gets too big,almost 2 days to sync 20Tb over 1Gbps link, 5 hrs over 10Gbpsbringin...
UseProductionso we did what we saw as the last resort - used productionnot for greenfield development, more for mature fea...
UseProduction(sometimes)so we did what we saw as the last resort - used productionnot for greenfield development, more for...
goes without saying this can be dangerousalso difficult if done right, we’ve been working on this for a yearhttp://www.flic...
Approachtwo big things: cultural and technical
SolveCultureIssuesFirstpart of figuring this out was exhausting all other optionsgetting buy-in from major stakeholders
Two“Simple”TechnicalIssues
step0:failurerecovery
step1:makeitsafehow to have test data in production, prevent stupid mistakes
phasedrollout
phasedrolloutread-only
phasedrolloutread-onlyr/wdevshardonly
phasedrolloutread-onlyr/wdevshardonlyfullr/w
How?how did we do it?
QuickOverviewhigh level viewhttp://www.flickr.com/photos/h-k-d/7852444560/sizes/o/in/photostream/
tickets indexshard1 shard2 shardN
tickets indexshard1 shard2 shardNUniqueIDs
tickets indexshard1 shard2 shardNShardLookup
tickets indexshard1 shard2 shardNStore/RetrieveData
devshardintroducing....dev shard, shard used for initial writes of data created when coming from devenv
tickets indexshard1 shard2 shardN
tickets indexshard1 shard2 shardNDEVshard
shard1 shard2 shardNDEVshardwww.etsy.com www.goulah.vmInitialWrites
shard1 shard2 shardNDEVshardwww.etsy.com www.goulah.vmInitialWrites
shard1 shard2 shardNDEVshardwww.etsy.com www.goulah.vmInitialWrites
mysqlproxy
proxy hits all of the shards/index/ticketshttp://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-...
dangerous/unnecessaryqueries-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instea...
dangerous/unnecessaryqueries(DEV) etsy_rw@jgoulah [test]>select * from fred_test;-- filter dangerous queries - (queries wi...
dangerous/unnecessaryqueries(DEV) etsy_rw@jgoulah [test]>select * from fred_test;ERROR 9001 (E9001): Selects fromtables mu...
knownin/egressfunnelwe know where all of the queries from dev originate fromhttp://www.flickr.com/photos/medevac71/48755269...
explicitlyenabled% dev_proxy onDev-Proxy config is now ON. Usedev_proxy off to turn it off.Not on all the time
visualnotifications
notify engineers they are using the proxy,this is read-only mode
read/writemode
read-write mode, needed for login and other things that write data
stealthdatahiding data from users(favorites go on dev and prod shard, making sure test user/shops don’tshow up)http://www....
Securityhttp://www.flickr.com/photos/sidelong/3878741556/sizes/l/in/photostream/
PCItoken exchange only, locked down for most people
PCIoff-limitstoken exchange only, locked down for most people
anomalydetectionanother part of our security setup is detection
loggingbasics of anomaly detection is log collection
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[h...
login-as(read only, logged w/ reason for access)
reasonisrecordedandreviewed
Recovery
sourcesofrestoredata
sourcesofrestoredataHadoop
sourcesofrestoredataHadoopBackups
sourcesofrestoredataHadoopBackupsDelayedSlaves
DelayedSlavespt-slave-delay watches a slave and starts and stops its replication SQL thread asnecessary to hold ithttp://w...
DelayedSlavesrole of the delayed slavealso source of BCP(business continuity planning - prevention and recovery of threats)
4hourdelaybehindmasterDelayedSlavesrole of the delayed slavealso source of BCP(business continuity planning - prevention a...
4hourdelaybehindmasterproducerowbasedbinarylogsDelayedSlavesrole of the delayed slavealso source of BCP(business continuit...
4hourdelaybehindmasterproducerowbasedbinarylogsDelayedSlavesallowforquickrecoveryrole of the delayed slavealso source of B...
pt-slave-delay--daemonize--pid/var/run/pt-slave-delay.pid--log/var/log/pt-slave-delay.log--delay4h--interval1m--nocontinue...
R/W R/WSlaveShardPair
R/W R/WSlaveShardPairpt-slave-delay
R/W R/WSlaveShardPairpt-slave-delayrowbasedbinlogs
R/W R/WSlaveShardPairHDFSVerticaParse/Transformin addition can use slaves to send data to other stores for offline queries...
somethingbadhappens...bad query is run (bad update, etc)http://www.flickr.com/photos/focalintent/1332072795/sizes/o/in/phot...
A BSlaveBeforeRestoration....master.info should be pointing to the right placestep 2 could be flipping physical box (for fa...
A BSlaveBeforeRestoration....1)stopdelayedslavereplicationmaster.info should be pointing to the right placestep 2 could be...
BSlaveBeforeRestoration....1)stopdelayedslavereplication2)pullsideA Amaster.info should be pointing to the right placestep...
BSlaveBeforeRestoration....3)stopmaster-masterreplication1)stopdelayedslavereplication2)pullsideA Amaster.info should be p...
> SHOW SLAVE STATUSRelay_Log_File: dbslave-relay.007178Relay_Log_Pos: 8666654ondelayedslaveget the relay position
mysql> show relaylog events in "dbslave-relay.007178"from 8666654 limit 1G*************************** 1. row *************...
filterbadqueriescycle through all the logs, analyze Query eventsrotate events - next log filelast relay log points to binlog...
BSlaveAfterDelayedSlaveDataIsRestored....Amaster.info should be pointing to the right placestep 2 could be flipping physica...
BSlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslaveAmaster.info should be pointing to the right placestep 2 co...
BSlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslave2)copydatafilestoAAmaster.info should be pointing to the rig...
BSlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslave2)copydatafilestoA3)restartBtoAreplication,letAcatchuptoBAma...
SlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslave2)copydatafilestoA3)restartBtoAreplication,letAcatchuptoBA4)r...
OtherFormsofRecoveryMigrateSingleObject(user/shop/etc)HadoopDeltasBackup+Binlogsmigrate object from delayed slave (similar...
UseCaseswhat are some use cases?http://www.flickr.com/photos/seatbelt67/502255276/sizes/o/in/photostream/
userreportsabug...a user files a bug, i can trace the code for the exact page theyre on right from mydev machine
testing“dry”writestesting how application runs a “dry” write --r/o mode, exception is thrown with the exact query it would...
searchadscampaignconsistencystarting campaigns and maintaining consistency for entire ad system is nearly impossible indev...
googleproductlistingadsGPLA is where we syndicate our listings to google to be used in google product search adswe can tes...
testingprototypesfeatures like similar items search gives better results in production because of theamount of data,allowe...
performancetestingneed a real data set to test pages like treasury search with lots of threads/avatars/etcthe dev data is ...
hadoopgenerateddatasetsdataset produced from hadoop (recommendations for users, or statisticsabout usage)but since hadoop ...
browseslicesbrowse slices have complex population so its easier to test experiment against prod data
not enough listings to populate the narrower subcategories, and it just takes too long
ThankYouetsy.com/jobsWe’re hiring
Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
Upcoming SlideShare
Loading in...5
×

Crossing the Production Barrier: Development at Scale

4,961

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,961
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
35
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Crossing the Production Barrier: Development at Scale

  1. 1. jgoulah@etsy.com/@johngoulahCrossingtheProductionBarrierDevelopmentAtScale
  2. 2. The world’s handmade marketplaceplatform for people to sell homemade, crafts, and vintage goods
  3. 3. 42MMuniquevisitors/mo.
  4. 4. 1.5B+pageviews/mo.42MMuniquevisitors/mo.
  5. 5. 1.5B+pageviews/mo.42MMuniquevisitors/mo.850Kshops/200countries
  6. 6. 1.5B+pageviews/mo.895MMsalesin201242MMuniquevisitors/mo.850Kshops/200countries
  7. 7. big cluster, 20 shards and adding 5 more
  8. 8. over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  9. 9. 4TBInnoDBbufferpoolover 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  10. 10. 4TBInnoDBbufferpool20TB+datastoredover 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  11. 11. 60K+queries/secavg4TBInnoDBbufferpool20TB+datastoredover 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  12. 12. 60K+queries/secavg4TBInnoDBbufferpool20TB+datastored~1.2Gbpsoutbound(plaintext)over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  13. 13. 60K+queries/secavg4TBInnoDBbufferpool20TB+datastored99.99%queriesunder1ms~1.2Gbpsoutbound(plaintext)over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  14. 14. 50+MySQLservers/800CPUsServerSpecHPDL380G796GBRAM16spindles/1TBRAID1024Core16 x 146GB
  15. 15. TheProblembeen around since ’05,hit this a few years ago, every big company probably has this issue
  16. 16. DATAsync prod to dev, until prod data gets too bighttp://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/photostream/
  17. 17. SomeApproachessubsets have to end somewhere (a shop has favorites that are connectedto people, connected to shops, etc)generated data can be time consuming to fake
  18. 18. SomeApproachessubsetsofdatasubsets have to end somewhere (a shop has favorites that are connectedto people, connected to shops, etc)generated data can be time consuming to fake
  19. 19. SomeApproachessubsetsofdatagenerateddatasubsets have to end somewhere (a shop has favorites that are connectedto people, connected to shops, etc)generated data can be time consuming to fake
  20. 20. But...but there is a problem with both of those approaches
  21. 21. EdgeCaseswhat about testing edge cases, difficult to diagnose bugs?hard to model the same data set that produced a user facing bughttp://www.flickr.com/photos/sovietuk/141381675/sizes/l/in/photostream/
  22. 22. Perspectiveanother issue is testing problems at scale, complex and large gobs ofdatareal social network ecosystem can be difficult to generate (favorites,follows)(activity feed, “similar items” search gives better results)http://www.flickr.com/photos/donsolo/2136923757/sizes/l/in/photostream/
  23. 23. Prod Dev?what most people do before data gets too big,almost 2 days to sync 20Tb over 1Gbps link, 5 hrs over 10Gbpsbringing prod dataset to dev was expensive hardware/maint,keeping parity with prod, and applying schema changes would take at leastas long
  24. 24. UseProductionso we did what we saw as the last resort - used productionnot for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable
  25. 25. UseProduction(sometimes)so we did what we saw as the last resort - used productionnot for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable
  26. 26. goes without saying this can be dangerousalso difficult if done right, we’ve been working on this for a yearhttp://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/photostream/
  27. 27. Approachtwo big things: cultural and technical
  28. 28. SolveCultureIssuesFirstpart of figuring this out was exhausting all other optionsgetting buy-in from major stakeholders
  29. 29. Two“Simple”TechnicalIssues
  30. 30. step0:failurerecovery
  31. 31. step1:makeitsafehow to have test data in production, prevent stupid mistakes
  32. 32. phasedrollout
  33. 33. phasedrolloutread-only
  34. 34. phasedrolloutread-onlyr/wdevshardonly
  35. 35. phasedrolloutread-onlyr/wdevshardonlyfullr/w
  36. 36. How?how did we do it?
  37. 37. QuickOverviewhigh level viewhttp://www.flickr.com/photos/h-k-d/7852444560/sizes/o/in/photostream/
  38. 38. tickets indexshard1 shard2 shardN
  39. 39. tickets indexshard1 shard2 shardNUniqueIDs
  40. 40. tickets indexshard1 shard2 shardNShardLookup
  41. 41. tickets indexshard1 shard2 shardNStore/RetrieveData
  42. 42. devshardintroducing....dev shard, shard used for initial writes of data created when coming from devenv
  43. 43. tickets indexshard1 shard2 shardN
  44. 44. tickets indexshard1 shard2 shardNDEVshard
  45. 45. shard1 shard2 shardNDEVshardwww.etsy.com www.goulah.vmInitialWrites
  46. 46. shard1 shard2 shardNDEVshardwww.etsy.com www.goulah.vmInitialWrites
  47. 47. shard1 shard2 shardNDEVshardwww.etsy.com www.goulah.vmInitialWrites
  48. 48. mysqlproxy
  49. 49. proxy hits all of the shards/index/ticketshttp://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html
  50. 50. dangerous/unnecessaryqueries-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTERstatements don’t run from dev)
  51. 51. dangerous/unnecessaryqueries(DEV) etsy_rw@jgoulah [test]>select * from fred_test;-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTERstatements don’t run from dev)
  52. 52. dangerous/unnecessaryqueries(DEV) etsy_rw@jgoulah [test]>select * from fred_test;ERROR 9001 (E9001): Selects fromtables must have where clauses-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTERstatements don’t run from dev)
  53. 53. knownin/egressfunnelwe know where all of the queries from dev originate fromhttp://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/photostream/
  54. 54. explicitlyenabled% dev_proxy onDev-Proxy config is now ON. Usedev_proxy off to turn it off.Not on all the time
  55. 55. visualnotifications
  56. 56. notify engineers they are using the proxy,this is read-only mode
  57. 57. read/writemode
  58. 58. read-write mode, needed for login and other things that write data
  59. 59. stealthdatahiding data from users(favorites go on dev and prod shard, making sure test user/shops don’tshow up)http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/photostream/
  60. 60. Securityhttp://www.flickr.com/photos/sidelong/3878741556/sizes/l/in/photostream/
  61. 61. PCItoken exchange only, locked down for most people
  62. 62. PCIoff-limitstoken exchange only, locked down for most people
  63. 63. anomalydetectionanother part of our security setup is detection
  64. 64. loggingbasics of anomaly detection is log collection
  65. 65. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;
  66. 66. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date
  67. 67. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date threadid
  68. 68. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date threadidsourceip
  69. 69. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date threadidsourceipuniqueidgeneratedbyproxy
  70. 70. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date threadidsourceipuniqueidgeneratedbyproxyapprequestid
  71. 71. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date threadidsourceipuniqueidgeneratedbyproxyapprequestid dest.shard
  72. 72. 2013-04-22 18:05:43 485370821 devproxy --/* DEVPROXY source=10.101.194.19:40198uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361[htSp8458VmHlC] [etsy_index_B] [browse.php] */SELECT id FROM table;date threadidsourceipuniqueidgeneratedbyproxyapprequestid dest.shard script
  73. 73. login-as(read only, logged w/ reason for access)
  74. 74. reasonisrecordedandreviewed
  75. 75. Recovery
  76. 76. sourcesofrestoredata
  77. 77. sourcesofrestoredataHadoop
  78. 78. sourcesofrestoredataHadoopBackups
  79. 79. sourcesofrestoredataHadoopBackupsDelayedSlaves
  80. 80. DelayedSlavespt-slave-delay watches a slave and starts and stops its replication SQL thread asnecessary to hold ithttp://www.flickr.com/photos/xploded/141295823/sizes/o/in/photostream/
  81. 81. DelayedSlavesrole of the delayed slavealso source of BCP(business continuity planning - prevention and recovery of threats)
  82. 82. 4hourdelaybehindmasterDelayedSlavesrole of the delayed slavealso source of BCP(business continuity planning - prevention and recovery of threats)
  83. 83. 4hourdelaybehindmasterproducerowbasedbinarylogsDelayedSlavesrole of the delayed slavealso source of BCP(business continuity planning - prevention and recovery of threats)
  84. 84. 4hourdelaybehindmasterproducerowbasedbinarylogsDelayedSlavesallowforquickrecoveryrole of the delayed slavealso source of BCP(business continuity planning - prevention and recovery of threats)
  85. 85. pt-slave-delay--daemonize--pid/var/run/pt-slave-delay.pid--log/var/log/pt-slave-delay.log--delay4h--interval1m--nocontinuelast 3 options most important,4h delay, interval is how frequently it should check whether slaveshould be started or stoppednocontinue - don’t continue replication normally on exitxuser/pass eliminated for brevity
  86. 86. R/W R/WSlaveShardPair
  87. 87. R/W R/WSlaveShardPairpt-slave-delay
  88. 88. R/W R/WSlaveShardPairpt-slave-delayrowbasedbinlogs
  89. 89. R/W R/WSlaveShardPairHDFSVerticaParse/Transformin addition can use slaves to send data to other stores for offline queries1)parse each binlog file to generate sequence file of row changes2)apply the row changes to a previous set for the latest version
  90. 90. somethingbadhappens...bad query is run (bad update, etc)http://www.flickr.com/photos/focalintent/1332072795/sizes/o/in/photostream/
  91. 91. A BSlaveBeforeRestoration....master.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  92. 92. A BSlaveBeforeRestoration....1)stopdelayedslavereplicationmaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  93. 93. BSlaveBeforeRestoration....1)stopdelayedslavereplication2)pullsideA Amaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  94. 94. BSlaveBeforeRestoration....3)stopmaster-masterreplication1)stopdelayedslavereplication2)pullsideA Amaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  95. 95. > SHOW SLAVE STATUSRelay_Log_File: dbslave-relay.007178Relay_Log_Pos: 8666654ondelayedslaveget the relay position
  96. 96. mysql> show relaylog events in "dbslave-relay.007178"from 8666654 limit 1G*************************** 1. row *******************Log_name: dbslave-relay.007178Pos: 8666654Event_type: QueryServer_id: 1016572End_log_pos: 8666565Info: use `etsy_shard`; /*[CVmkWxhD7gsatX8hLbkDoHk29iKo] [etsy_shard_001_B] [/your/activity/index.php] */ UPDATE `news_feed_stats`SET `time_last_viewed` = 1366406780, `update_time` =1366406780 WHERE `owner_id` = 30793071 AND`owner_type_id` = 2 AND `feed_type` = owner2 rows in set (0.00 sec)ondelayedslaveshow relaylog events will show statements from relay logpass relay log and position to start
  97. 97. filterbadqueriescycle through all the logs, analyze Query eventsrotate events - next log filelast relay log points to binlog master(server_id is masters, binlog coord matches master_log_file/pos)http://www.flickr.com/photos/chriswaits/6607823843/sizes/l/in/photostream/
  98. 98. BSlaveAfterDelayedSlaveDataIsRestored....Amaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  99. 99. BSlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslaveAmaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  100. 100. BSlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslave2)copydatafilestoAAmaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  101. 101. BSlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslave2)copydatafilestoA3)restartBtoAreplication,letAcatchuptoBAmaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  102. 102. SlaveAfterDelayedSlaveDataIsRestored....1)stopmysqlonAandslave2)copydatafilestoA3)restartBtoAreplication,letAcatchuptoBA4)restartAtoBreplication,putAbackin,thenpullBA Bmaster.info should be pointing to the right placestep 2 could be flipping physical box (for faster recovery such as indexservers)
  103. 103. OtherFormsofRecoveryMigrateSingleObject(user/shop/etc)HadoopDeltasBackup+Binlogsmigrate object from delayed slave (similar to shard migration)can generate deltas from hadoopif delayed slave has “played” the bad data, go from last nights backup(slower)
  104. 104. UseCaseswhat are some use cases?http://www.flickr.com/photos/seatbelt67/502255276/sizes/o/in/photostream/
  105. 105. userreportsabug...a user files a bug, i can trace the code for the exact page theyre on right from mydev machine
  106. 106. testing“dry”writestesting how application runs a “dry” write --r/o mode, exception is thrown with the exact query it would have attempted to run,the values it tried to use, etc.
  107. 107. searchadscampaignconsistencystarting campaigns and maintaining consistency for entire ad system is nearly impossible indevSearch ads data is stored in more than a dozen DB tables and state changes are driven by acombination of browsers triggering ads,sellers managing their campaigns, and a slew of crons running anywhere from once per 5 minutesto once a montheg) to test pausing campaigns that run out of money mid-day,can pull large numbers of campaigns from prod and operate on those to verify that the data willstill be consistent
  108. 108. googleproductlistingadsGPLA is where we syndicate our listings to google to be used in google product search adswe can test edge cases in GPLA syndication where it would be difficult to recreate thestate in dev
  109. 109. testingprototypesfeatures like similar items search gives better results in production because of theamount of data,allowed us to test the quality of listings a prototype was displaying
  110. 110. performancetestingneed a real data set to test pages like treasury search with lots of threads/avatars/etcthe dev data is too sparse, xhprof traces don’t mean anything, missing avatars changeperf characteristics
  111. 111. hadoopgenerateddatasetsdataset produced from hadoop (recommendations for users, or statisticsabout usage)but since hadoop is prod data its for prod users/listings/shops, so have tocheck against prod--- sync to dev would fill dev dbs and data wouldn’t line up (b/c prod data)
  112. 112. browseslicesbrowse slices have complex population so its easier to test experiment against prod data
  113. 113. not enough listings to populate the narrower subcategories, and it just takes too long
  114. 114. ThankYouetsy.com/jobsWe’re hiring

×