The Shard Revisited
Tools and Techniques Used at Etsy

jgoulah@etsy.com / @johngoulah

Tuesday, November 12, 13
Tuesday, November 12, 13

A marketplace for people around the world to connect, buy, and sell
unique goods
Etsy is the mar...
60MM+ unique visitors/mo.
1.5B+ page views / mo.
1M+ shops / 200 countries
895MM sales in 2012
Tuesday, November 12, 13
Tuesday, November 12, 13

this talk consists of the architecture, our dev data problem/solution, and
other tools
big clust...
6TB InnoDB buffer pool
30TB+ data stored
100K+ queries/sec avg
~1.8Gbps outbound (plain text)
99.9% queries under 1ms
Tuesd...
~100 MySQL servers
1100 15K rpm disks / 1600+ CPU’s
Server Spec
HP DL 380 G8
96GB RAM
16 spindles / 2TB RAID 10
24 Core
Tu...
Architecture
Tuesday, November 12, 13

2 key concerns when you reach scale....
Redundancy
Tuesday, November 12, 13

the duplication of critical components of a system with the intention of
increasing r...
Master - Master
R/W

Tuesday, November 12, 13

duplication of critical components....

R/W
Master - Master
R/W

R/W

Side A

Side B

Tuesday, November 12, 13

we call these sides “replicants”
Scalability
Tuesday, November 12, 13

the ability of a system to handle growing amount of work in a capable
manner
(grocer...
shard 1

shard 2

shard N

...

Tuesday, November 12, 13

horizontal scaling
shard 1

shard 2

shard N

...

shard N + 1

Tuesday, November 12, 13

horizontal scaling
shard 1

shard 2

shard N

...
Migrate

Migrate
shard N + 1

Tuesday, November 12, 13

horizontal scaling

Migrate
Bird’s-Eye View

Tuesday, November 12, 13

http://www.flickr.com/photos/feuilllu/36612719/sizes/l/in/
photostream/
tickets

shard 1

index

shard 2

Tuesday, November 12, 13

3 main components
couple others, dbaux, dbtasks

shard N
tickets

index

Unique IDs
shard 1

Tuesday, November 12, 13

shard 2

shard N
tickets

index

Shard Lookup
shard 1

Tuesday, November 12, 13

shard 2

shard N
tickets

shard 1

index

shard 2

Store/Retrieve Data
Tuesday, November 12, 13

shard N
Basics
Tuesday, November 12, 13

what is sharding?
users_groups
user_id

group_id

1

A

1

B

2

A

2

C

3

A

3

B

3

C

Tuesday, November 12, 13
users_groups
user_id

group_id

1

A

1

B

2

A

2

C

3

A

3

B

3

C

Tuesday, November 12, 13

creating horizontal pa...
users_groups
user_id

group_id

1

A

1

B

2

A

user_id

group_id

2

C

3

A

3

A

3

B

3

B

3

C

3

C

Tuesday, No...
users_groups
shard 1
user_id

group_id

1

A

1

B

2

A

user_id

group_id

2

C

3

A

3

B

3

C

Tuesday, November 12,...
Index Servers
Tuesday, November 12, 13

have to be able to find the data, these simply exist to look up where the
data is
t...
index

shard 1

shard 2

Tuesday, November 12, 13

want to find details for a user

shard N
index

shard 1

select shard_id from user_index
where user_id = X

shard 2

Tuesday, November 12, 13

first get the shard i...
index

select shard_id from user_index
where user_id = X

returns 1
shard 1

Tuesday, November 12, 13

shard 2

shard N
index

shard 1

Tuesday, November 12, 13

select join_date from users
where user_id = X

shard 2

shard N
index

select join_date from users
where user_id = X

returns 2012-02-05
shard 1

Tuesday, November 12, 13

shard 2

shard...
Ticket Servers
Tuesday, November 12, 13

http://www.flickr.com/photos/rexroof/5126088323/sizes/l/in/
photostream/
Globally Unique ID
Tuesday, November 12, 13

can’t use auto-increment with distributed system, hand out globally
unique id...
CREATE TABLE `tickets` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`stub` char(1) NOT NULL default '',
PRIMARY KEY...
Ticket Generation
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

Tuesday, November 12, 13

since valu...
Ticket Generation
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
SELECT * FROM tickets;
id
4589294

Tu...
tickets A
auto-increment-increment = 2
auto-increment-offset = 1

tickets B
auto-increment-increment = 2
auto-increment-offs...
tickets A
auto-increment-increment = 2
auto-increment-offset = 1

tickets B
auto-increment-increment = 2
auto-increment-offs...
Shards
Tuesday, November 12, 13

shards hold the majority of the data
http://www.flickr.com/photos/merrickb/63999750/sizes/...
Object Hashing
....aka pinning data to one side of the shard

Tuesday, November 12, 13

after we determine the shard we ha...
A

user_id : 500
Tuesday, November 12, 13

so we know the shard, now which replicant
object id in this case is user_id
sid...
A

B

user_id : 500 % (# active replicants)
Tuesday, November 12, 13
A

B

'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',
'etsy_index_B' => 'm...
A

user_id : 500 % (2)
Tuesday, November 12, 13

B
A

user_id : 500 % (2) == 0
Tuesday, November 12, 13

B
A

user_id : 500 % (2) == 0
Tuesday, November 12, 13

B

select ...
insert ...
update ...
A

B

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1
Tuesday, November 12, 13
500

select ...
insert ...
update ...

A

B

501

select ...
insert ...
update ...

user_id : 500 % (2) == 0
user_id : 501...
Failure
Tuesday, November 12, 13

http://www.flickr.com/photos/44124348109@N01/6467405231/
A

B

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1
Tuesday, November 12, 13
A

B

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1
Tuesday, November 12, 13
A

B

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1
Tuesday, November 12, 13
A

B

'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',
'etsy_index_B' => 'm...
A

B

'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',
'etsy_index_B' => 'm...
A

B

user_id : 500 % (1) == 0
user_id : 501 % (1) == 0
Tuesday, November 12, 13
Variants
Tuesday, November 12, 13

variants are mirrors of the same data in different tables
http://www.flickr.com/photos/g...
shard 2

shard 1

user_id

group_id

user_id

group_id

1

A

3

A

1

B

3

B

2

A

4

A

2

C

5

C

SELECT user_id FRO...
shard 2

shard 1

user_id

group_id

user_id

group_id

1

A

3

A

1

B

3

B

2

A

4

A

2

C

5

C

SELECT user_id FRO...
shard 2

shard 1

user_id

group_id

3

A

3

B

A

4

A

C

5

C

user_id

group_id

1

A

1

B

2
2

JOIN

SELECT user_i...
users_groups

groups_users

user_id

group_id

group_id

user_id

1

A

A

1

1

B

A

3

2

A

A

2

2

C

B

3

3

A

B
...
users_groups_index

groups_users_index

shard_id

group_id

shard_id

1

1

A

1

2

1

B

2

3

2

C

2

4

index

user_i...
users_groups_index

groups_users_index

shard_id

group_id

shard_id

1

1

A

1

2

1

B

2

3

2

C

2

4

index

user_i...
Dev Data
Tuesday, November 12, 13

now lets talk about development data
The Problem
Tuesday, November 12, 13

hit this a few years ago, every big company probably has this issue
DATA

Tuesday, November 12, 13

sync prod to dev, until prod data gets too big
http://www.flickr.com/photos/uwwresnet/62808...
Some Approaches
subsets of data
generated data
Tuesday, November 12, 13

subsets have to end somewhere (a shop has favorit...
But...
Tuesday, November 12, 13

but there is a problem with both of those approaches
Edge Cases
Tuesday, November 12, 13

what about testing edge cases, difficult to diagnose bugs?
hard to model the same dat...
Complexity

Tuesday, November 12, 13

another issue is testing problems at scale, complex and large gobs of
data
real soci...
Copy prod data to dev ?
Tuesday, November 12, 13

what most people do before data gets too big,
almost 3 days to sync 30Tb...
instead....

Use Production
(sometimes)
Tuesday, November 12, 13

so we did what we saw as the last resort - used producti...
Tuesday, November 12, 13

goes without saying this can be dangerous, and people have to be aware
they are doing it
http://...
introducing....

dev shard
Tuesday, November 12, 13

dev shard, shard used for initial writes of data created when coming ...
tickets

shard 1

Tuesday, November 12, 13

index

shard 2

shard N
tickets

shard 1

index

shard 2

shard N
DEV shard

Tuesday, November 12, 13
Initial Writes
www.etsy.com

shard 1

www.goulah.vm

shard 2

shard N
DEV shard

Tuesday, November 12, 13
Initial Writes
www.etsy.com

shard 1

www.goulah.vm

shard 2

shard N
DEV shard

Tuesday, November 12, 13

writes from ets...
Initial Writes
www.etsy.com

shard 1

www.goulah.vm

shard 2

shard N
DEV shard

Tuesday, November 12, 13

writes from my ...
mysql proxy
Tuesday, November 12, 13
Tuesday, November 12, 13

proxy hits all of the shards/index/tickets
http://www.oreillynet.com/pub/a/databases/2007/07/12/...
explicitly enabled
% dev_proxy on
Dev-Proxy config is now ON. Use
'dev_proxy off' to turn it off.

Tuesday, November 12, 1...
visual notifications

Tuesday, November 12, 13
Tuesday, November 12, 13

notify engineers they are using the proxy,
this is read-only mode
read/write mode

Tuesday, November 12, 13
Tuesday, November 12, 13

read-write mode, needed for login and other things that write data
% ./bin/myscript
YOU CURRENTLY HAVE THE READ WRITE PROXY TURNED ON AND ARE
RUNNING A CLI SCRIPT!!!
You must type the phras...
known input/output

Tuesday, November 12, 13

we know where all of the queries from dev originate from
http://www.flickr.co...
dangerous/unnecessary queries
(DEV) etsy_rw@jgoulah [test]>
select * from fred_test;
ERROR 9001 (E9001): Selects from
tabl...
logging

Tuesday, November 12, 13

basics of anomaly detection is log collection
2013-04-22 18:05:43 485370821 devproxy --

date

thread id

/* DEVPROXY source=10.101.194.19:40198

source ip
uuid=c309e8d...
Tuesday, November 12, 13
stealth data

Tuesday, November 12, 13

hiding data from users
(favorites go on dev and prod shard, making sure test user/...
overlays
Tuesday, November 12, 13

An overlay is a local copy of production data
If there are overlays in place in dev, it...
prod
user_id

group_id

1

A

1

B

2

A

2

dev
copy

user_id

group_id

C

3

A

3

A

3

B

3

B

3

C

3

C

store in ...
Delayed Slaves

Tuesday, November 12, 13

pt-slave-delay watches a slave and starts and stops its replication SQL thread a...
Delayed Slaves
4 hour delay behind master
produce row based binary logs
allow for quick recovery
Tuesday, November 12, 13
...
pt-slave-delay --daemonize
--pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log
--delay 4h --interval 1m --...
Shard Pair
R/W

Slave
Tuesday, November 12, 13

R/W

pt-slave-delay
row based binlogs
Shard Pair
R/W

R/W
HDFS

Slave

Parse/
Transform

Vertica

Tuesday, November 12, 13

in addition can use slaves to send d...
Schema Changes
Tuesday, November 12, 13

alters take forever, lock rows being altered
(this is why we have new things like...
shard 1

shard 2

shard N

Tuesday, November 12, 13

LOTS of servers to apply changes to PLUS the alter problem
shard 1

shard 2

Tuesday, November 12, 13

apply to a side that is inactive

shard N
Schemanator
Tuesday, November 12, 13
Tuesday, November 12, 13

!! explain the config push process a bit
also this is used to apply the alters
Tuesday, November 12, 13
shard 1

Tuesday, November 12, 13

shard 2

shard N
shard 1

shard 2

SET SQL_LOG_BIN = 0; ALTER TABLE user ....
Tuesday, November 12, 13

shard N
Tuesday, November 12, 13
Tuesday, November 12, 13

check two things in test phase:
- schema applies to blank db
- table validates against our sql s...
shard migration
Tuesday, November 12, 13

migration of data from one shard to another
Why?

Tuesday, November 12, 13

why migrate data?
Prevent disk from filling

Tuesday, November 12, 13
Prevent disk from filling
High traffic objects (shops, users)

Tuesday, November 12, 13

high traffic == disk usage and I/O u...
Prevent disk from filling
High traffic objects (shops, users)
Shard rebalancing

Tuesday, November 12, 13

rebalancing when a...
When?

Tuesday, November 12, 13
Tuesday, November 12, 13

users per shard
Balance

Tuesday, November 12, 13

how many users on each shard
per object migration
<object type> <object id> <shard>

# migrate_object User 5307827 2

Tuesday, November 12, 13
percentage migration
<object type> <percent> <old shard> <new shard>

# migrate_pct User 25 3 6

Tuesday, November 12, 13
index
user_id

Tuesday, November 12, 13

migration_lock

old_shard_id

1

shard 1

shard_id
1

0

0

shard 2

shard N
index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock

shard 1

shard 2

shard N

Tuesday, November 12,...
index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock
•Migrate

shard 1

Tuesday, November 12, 13

shar...
index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock
•Migrate
•Checksum
shard 1

shard 2

Tuesday, Nov...
index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock
•Migrate
•Checksum
shard 1

Tuesday, November 12,...
index
user_id

shard_id

migration_lock

old_shard_id

1

2

0

1

•Lock
•Migrate
•Checksum
•Unlock
shard 1

Tuesday, Nove...
index
user_id

shard_id

migration_lock

old_shard_id

1

2

0

1

•Lock
•Migrate
•Checksum
•Unlock
•Delete (from old shar...
Logical Shards
Tuesday, November 12, 13

Writing data into the new shard, deleting data from the old shard and then optimi...
dbshard38
mysql
db_300
db_301
db_302
db_303
db_304
db_305
....

Tuesday, November 12, 13

with this, slave replication is ...
'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw',
'etsy_shard_001_B' => ...
Advantages
• multi threaded slave
• simpler migrations

Tuesday, November 12, 13

In MySQL 5.6 we have multi-threaded slav...
Logical Shard Migrations

Tuesday, November 12, 13

Lets walk through a logical shard migration...
dbshard41

dbshard61

db_300
db_301
....
db_312

dbshard42

dbshard62

db_300
db_301
....
db_312
Tuesday, November 12, 13
...
dbshard41

dbshard61

db_300
db_301
....
db_312

db_300
db_301
....
db_312

restore
backup

dbshard42

dbshard62

db_300
d...
dbshard41

dbshard61

db_300
db_301
....
db_312

db_300
db_301
....
db_312

slave

slave
dbshard42

dbshard62

db_300
db_3...
dbshard41

dbshard61

db_300
db_301
....
db_312

db_300
db_301
....
db_312

slave

slave
dbshard42

dbshard62

db_300
db_3...
dbshard41

dbshard61

db_300
db_301
....
db_312

db_307

dbshard42
db_300
db_301
....
db_312

slave

config:
db_307-312
cha...
dbshard41

dbshard61

db_300
db_301
....
db_312

db_307
....
db_312

slave
dbshard42

dbshard62

db_300
db_301
....
db_312...
dbshard41

dbshard61

db_300
db_301
....
db_312

db_307

dbshard42
db_300
db_301
....
db_312

....
db_312

config:
db_307-3...
dbshard41

dbshard61

db_300
db_301
....
db_306

db_307
....
db_312

slave
dbshard42

dbshard62

db_300
db_301
....
db_306...
Other Tools
Tuesday, November 12, 13
mysqlsummary
Tuesday, November 12, 13

essentially just reformatting show processlist
% mysqlsummary.pl --host dbshard31
Details for dbshard31
==================================
COMMAND SUMMARY
==============...
HOST SUMMARY
============
meteor03
meteor01
web0228
api05
worker05
worker12

10
8
3
3
3
3

SCRIPT SUMMARY
==============
J...
COMMAND TIMINGS
===============
---------------------------------------------------------------------+ HOST: worker19, USE...
ORM REPL
Tuesday, November 12, 13
% php-repl
[1] etsy-php> EtsyORM::getFinder('User');
→ object(EtsyModel_UserFinder)(
0 => 'countAll( SELECT count(*) FROM ...
qtop
Tuesday, November 12, 13

we send queries over UDP from our ORM, stick them in a db and to
analyze later
request cont...
Thank you
etsy.com/jobs
Tuesday, November 12, 13
Upcoming SlideShare
Loading in...5
×

The Shard Revisited: Tools and Techniques Used at Etsy

53,191

Published on

This goes over an overview of the architecture, and then goes into the development data problem. It also talks about some tools we use to do data migrations and schema changes.

Published in: Technology, Business

Transcript of "The Shard Revisited: Tools and Techniques Used at Etsy"

  1. 1. The Shard Revisited Tools and Techniques Used at Etsy jgoulah@etsy.com / @johngoulah Tuesday, November 12, 13
  2. 2. Tuesday, November 12, 13 A marketplace for people around the world to connect, buy, and sell unique goods Etsy is the marketplace that we all make together, and our mission is to re-imagine commerce in ways that build a more fulfilling and lasting world
  3. 3. 60MM+ unique visitors/mo. 1.5B+ page views / mo. 1M+ shops / 200 countries 895MM sales in 2012 Tuesday, November 12, 13
  4. 4. Tuesday, November 12, 13 this talk consists of the architecture, our dev data problem/solution, and other tools big cluster, 35 shards
  5. 5. 6TB InnoDB buffer pool 30TB+ data stored 100K+ queries/sec avg ~1.8Gbps outbound (plain text) 99.9% queries under 1ms Tuesday, November 12, 13 1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
  6. 6. ~100 MySQL servers 1100 15K rpm disks / 1600+ CPU’s Server Spec HP DL 380 G8 96GB RAM 16 spindles / 2TB RAID 10 24 Core Tuesday, November 12, 13 16 x 146GB
  7. 7. Architecture Tuesday, November 12, 13 2 key concerns when you reach scale....
  8. 8. Redundancy Tuesday, November 12, 13 the duplication of critical components of a system with the intention of increasing reliability example: jet engines
  9. 9. Master - Master R/W Tuesday, November 12, 13 duplication of critical components.... R/W
  10. 10. Master - Master R/W R/W Side A Side B Tuesday, November 12, 13 we call these sides “replicants”
  11. 11. Scalability Tuesday, November 12, 13 the ability of a system to handle growing amount of work in a capable manner (grocery store example)
  12. 12. shard 1 shard 2 shard N ... Tuesday, November 12, 13 horizontal scaling
  13. 13. shard 1 shard 2 shard N ... shard N + 1 Tuesday, November 12, 13 horizontal scaling
  14. 14. shard 1 shard 2 shard N ... Migrate Migrate shard N + 1 Tuesday, November 12, 13 horizontal scaling Migrate
  15. 15. Bird’s-Eye View Tuesday, November 12, 13 http://www.flickr.com/photos/feuilllu/36612719/sizes/l/in/ photostream/
  16. 16. tickets shard 1 index shard 2 Tuesday, November 12, 13 3 main components couple others, dbaux, dbtasks shard N
  17. 17. tickets index Unique IDs shard 1 Tuesday, November 12, 13 shard 2 shard N
  18. 18. tickets index Shard Lookup shard 1 Tuesday, November 12, 13 shard 2 shard N
  19. 19. tickets shard 1 index shard 2 Store/Retrieve Data Tuesday, November 12, 13 shard N
  20. 20. Basics Tuesday, November 12, 13 what is sharding?
  21. 21. users_groups user_id group_id 1 A 1 B 2 A 2 C 3 A 3 B 3 C Tuesday, November 12, 13
  22. 22. users_groups user_id group_id 1 A 1 B 2 A 2 C 3 A 3 B 3 C Tuesday, November 12, 13 creating horizontal partitions from a table
  23. 23. users_groups user_id group_id 1 A 1 B 2 A user_id group_id 2 C 3 A 3 A 3 B 3 B 3 C 3 C Tuesday, November 12, 13
  24. 24. users_groups shard 1 user_id group_id 1 A 1 B 2 A user_id group_id 2 C 3 A 3 B 3 C Tuesday, November 12, 13 shard 2
  25. 25. Index Servers Tuesday, November 12, 13 have to be able to find the data, these simply exist to look up where the data is to answer the question: what shard is the data on? http://www.flickr.com/photos/mamsy/4175783446/sizes/l/in/ photostream/
  26. 26. index shard 1 shard 2 Tuesday, November 12, 13 want to find details for a user shard N
  27. 27. index shard 1 select shard_id from user_index where user_id = X shard 2 Tuesday, November 12, 13 first get the shard id, have the PK shard N
  28. 28. index select shard_id from user_index where user_id = X returns 1 shard 1 Tuesday, November 12, 13 shard 2 shard N
  29. 29. index shard 1 Tuesday, November 12, 13 select join_date from users where user_id = X shard 2 shard N
  30. 30. index select join_date from users where user_id = X returns 2012-02-05 shard 1 Tuesday, November 12, 13 shard 2 shard N
  31. 31. Ticket Servers Tuesday, November 12, 13 http://www.flickr.com/photos/rexroof/5126088323/sizes/l/in/ photostream/
  32. 32. Globally Unique ID Tuesday, November 12, 13 can’t use auto-increment with distributed system, hand out globally unique id’s
  33. 33. CREATE TABLE `tickets` ( `id` bigint(20) unsigned NOT NULL auto_increment, `stub` char(1) NOT NULL default '', PRIMARY KEY (`id`), UNIQUE KEY `stub` (`stub`) ) ENGINE=MyISAM Tuesday, November 12, 13 only myisam tables, leverage myisam engine's lack of concurrency
  34. 34. Ticket Generation REPLACE INTO tickets (stub) VALUES ('a'); SELECT LAST_INSERT_ID(); Tuesday, November 12, 13 since value ‘a’ exists, it replaces the row with the same value (and bumps the id) if an old row in the table has the same value as a new row for a PK or a UNIQUE index, the old row is deleted before the new row is inserted
  35. 35. Ticket Generation REPLACE INTO tickets (stub) VALUES ('a'); SELECT LAST_INSERT_ID(); SELECT * FROM tickets; id 4589294 Tuesday, November 12, 13 stub a
  36. 36. tickets A auto-increment-increment = 2 auto-increment-offset = 1 tickets B auto-increment-increment = 2 auto-increment-offset = 2 Tuesday, November 12, 13 ODD:offset=1 EVEN: offset=2 http://openclipart.org/detail/94723/database-symbol-by-rg1024
  37. 37. tickets A auto-increment-increment = 2 auto-increment-offset = 1 tickets B auto-increment-increment = 2 auto-increment-offset = 2 NOT master-master Tuesday, November 12, 13 failure is ok, only lose last ticket id can bring another server up with new offset http://openclipart.org/detail/94723/database-symbol-by-rg1024
  38. 38. Shards Tuesday, November 12, 13 shards hold the majority of the data http://www.flickr.com/photos/merrickb/63999750/sizes/o/in/ photostream/
  39. 39. Object Hashing ....aka pinning data to one side of the shard Tuesday, November 12, 13 after we determine the shard we have to determine side A or side B given the replicant index also helps keep connections to a (relative) minimum since all stuff sharded by a specific instance will then pick the same side
  40. 40. A user_id : 500 Tuesday, November 12, 13 so we know the shard, now which replicant object id in this case is user_id side a/b are replicants B
  41. 41. A B user_id : 500 % (# active replicants) Tuesday, November 12, 13
  42. 42. A B 'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_index_B' => 'mysql:host=dbindex02.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_A' => 'mysql:host=dbshard03.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_B' => 'mysql:host=dbshard04.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_A' => 'mysql:host=dbshard05.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_B' => 'mysql:host=dbshard06.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', user_id : 500 % (# active replicants) Tuesday, November 12, 13 each master master pair in the config
  43. 43. A user_id : 500 % (2) Tuesday, November 12, 13 B
  44. 44. A user_id : 500 % (2) == 0 Tuesday, November 12, 13 B
  45. 45. A user_id : 500 % (2) == 0 Tuesday, November 12, 13 B select ... insert ... update ...
  46. 46. A B user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  47. 47. 500 select ... insert ... update ... A B 501 select ... insert ... update ... user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  48. 48. Failure Tuesday, November 12, 13 http://www.flickr.com/photos/44124348109@N01/6467405231/
  49. 49. A B user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  50. 50. A B user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  51. 51. A B user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  52. 52. A B 'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_index_B' => 'mysql:host=dbindex02.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_A' => 'mysql:host=dbshard03.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_B' => 'mysql:host=dbshard04.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_A' => 'mysql:host=dbshard05.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_B' => 'mysql:host=dbshard06.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  53. 53. A B 'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_index_B' => 'mysql:host=dbindex02.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_A' => 'mysql:host=dbshard03.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_B' => 'mysql:host=dbshard04.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_A' => 'mysql:host=dbshard05.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_B' => 'mysql:host=dbshard06.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', user_id : 500 % (2) == 0 user_id : 501 % (2) == 1 Tuesday, November 12, 13
  54. 54. A B user_id : 500 % (1) == 0 user_id : 501 % (1) == 0 Tuesday, November 12, 13
  55. 55. Variants Tuesday, November 12, 13 variants are mirrors of the same data in different tables http://www.flickr.com/photos/garibaldi/522196113/sizes/o/in/ photostream/
  56. 56. shard 2 shard 1 user_id group_id user_id group_id 1 A 3 A 1 B 3 B 2 A 4 A 2 C 5 C SELECT user_id FROM users_groups WHERE group_id = ‘A’ Tuesday, November 12, 13
  57. 57. shard 2 shard 1 user_id group_id user_id group_id 1 A 3 A 1 B 3 B 2 A 4 A 2 C 5 C SELECT user_id FROM users_groups WHERE group_id = ‘A’ Broken! Tuesday, November 12, 13
  58. 58. shard 2 shard 1 user_id group_id 3 A 3 B A 4 A C 5 C user_id group_id 1 A 1 B 2 2 JOIN SELECT user_id FROM users_groups WHERE group_id = ‘A’ Broken! Tuesday, November 12, 13
  59. 59. users_groups groups_users user_id group_id group_id user_id 1 A A 1 1 B A 3 2 A A 2 2 C B 3 3 A B 1 3 B C 2 3 C C 3 Tuesday, November 12, 13 mirror the data, map users to groups, groups to users
  60. 60. users_groups_index groups_users_index shard_id group_id shard_id 1 1 A 1 2 1 B 2 3 2 C 2 4 index user_id 3 D 3 separate indexes for different slices of data Tuesday, November 12, 13
  61. 61. users_groups_index groups_users_index shard_id group_id shard_id 1 1 A 1 2 1 B 2 3 2 C 2 4 index user_id 3 D 3 A B C 4 look up the groups a user is part of 4 4 Tuesday, November 12, 13 group_id 4 shard 3 user_id D
  62. 62. Dev Data Tuesday, November 12, 13 now lets talk about development data
  63. 63. The Problem Tuesday, November 12, 13 hit this a few years ago, every big company probably has this issue
  64. 64. DATA Tuesday, November 12, 13 sync prod to dev, until prod data gets too big http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/ photostream/
  65. 65. Some Approaches subsets of data generated data Tuesday, November 12, 13 subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc) generated data can be time consuming to fake
  66. 66. But... Tuesday, November 12, 13 but there is a problem with both of those approaches
  67. 67. Edge Cases Tuesday, November 12, 13 what about testing edge cases, difficult to diagnose bugs? hard to model the same data set that produced a user facing bug http://www.flickr.com/photos/kalexanderson/6199793967/sizes/o/in/ photostream/
  68. 68. Complexity Tuesday, November 12, 13 another issue is testing problems at scale, complex and large gobs of data real social network ecosystem can be difficult to generate (favorites, follows) (activity feed, “similar items” search gives better results in prod) http://www.flickr.com/photos/doug88888/4687906267/sizes/o/in/ photostream/
  69. 69. Copy prod data to dev ? Tuesday, November 12, 13 what most people do before data gets too big, almost 3 days to sync 30Tb over 1Gbps link, close to 10 hrs over 10Gbps bringing prod dataset to dev was expensive hardware/maint, keeping parity with prod, and applying schema changes would take at least as long
  70. 70. instead.... Use Production (sometimes) Tuesday, November 12, 13 so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugs we still have a dev database but the data is sparse and unreliable
  71. 71. Tuesday, November 12, 13 goes without saying this can be dangerous, and people have to be aware they are doing it http://instagram.com/p/d8nw9aNqlt/ http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/ photostream/
  72. 72. introducing.... dev shard Tuesday, November 12, 13 dev shard, shard used for initial writes of data created when coming from dev env
  73. 73. tickets shard 1 Tuesday, November 12, 13 index shard 2 shard N
  74. 74. tickets shard 1 index shard 2 shard N DEV shard Tuesday, November 12, 13
  75. 75. Initial Writes www.etsy.com shard 1 www.goulah.vm shard 2 shard N DEV shard Tuesday, November 12, 13
  76. 76. Initial Writes www.etsy.com shard 1 www.goulah.vm shard 2 shard N DEV shard Tuesday, November 12, 13 writes from etsy.com go everywhere -except- dev shard
  77. 77. Initial Writes www.etsy.com shard 1 www.goulah.vm shard 2 shard N DEV shard Tuesday, November 12, 13 writes from my vm -only- go to dev shard
  78. 78. mysql proxy Tuesday, November 12, 13
  79. 79. Tuesday, November 12, 13 proxy hits all of the shards/index/tickets http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html
  80. 80. explicitly enabled % dev_proxy on Dev-Proxy config is now ON. Use 'dev_proxy off' to turn it off. Tuesday, November 12, 13 Not on all the time
  81. 81. visual notifications Tuesday, November 12, 13
  82. 82. Tuesday, November 12, 13 notify engineers they are using the proxy, this is read-only mode
  83. 83. read/write mode Tuesday, November 12, 13
  84. 84. Tuesday, November 12, 13 read-write mode, needed for login and other things that write data
  85. 85. % ./bin/myscript YOU CURRENTLY HAVE THE READ WRITE PROXY TURNED ON AND ARE RUNNING A CLI SCRIPT!!! You must type the phrase 'read write proxy' and press enter to continue... Tuesday, November 12, 13
  86. 86. known input/output Tuesday, November 12, 13 we know where all of the queries from dev originate from http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/ photostream/
  87. 87. dangerous/unnecessary queries (DEV) etsy_rw@jgoulah [test]> select * from fred_test; ERROR 9001 (E9001): Selects from tables must have where clauses Tuesday, November 12, 13 -- filter dangerous queries - (queries without a WHERE) -- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)
  88. 88. logging Tuesday, November 12, 13 basics of anomaly detection is log collection
  89. 89. 2013-04-22 18:05:43 485370821 devproxy -- date thread id /* DEVPROXY source=10.101.194.19:40198 source ip uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361 unique id generated by proxy [htSp8458VmHlC] [etsy_index_B] [browse.php] */ app request id SELECT id FROM table; Tuesday, November 12, 13 dest. shard script
  90. 90. Tuesday, November 12, 13
  91. 91. stealth data Tuesday, November 12, 13 hiding data from users (favorites go on dev and prod shard, making sure test user/shops don’t show up in search) http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/ photostream/
  92. 92. overlays Tuesday, November 12, 13 An overlay is a local copy of production data If there are overlays in place in dev, it will send the queries to the local db instead (it does this by overriding looking up the shard on index, and checks for table/pk pair).
  93. 93. prod user_id group_id 1 A 1 B 2 A 2 dev copy user_id group_id C 3 A 3 A 3 B 3 B 3 C 3 C store in memcache: <table, pk> Tuesday, November 12, 13 Any time we write to the other shards from dev, the shard migration copies to be affected rows to their local mysql instance over the dev proxy and then stores the table/pk for subsequent lookup
  94. 94. Delayed Slaves Tuesday, November 12, 13 pt-slave-delay watches a slave and starts and stops its replication SQL thread as necessary to hold it http://www.flickr.com/photos/xploded/141295823/sizes/o/in/ photostream/
  95. 95. Delayed Slaves 4 hour delay behind master produce row based binary logs allow for quick recovery Tuesday, November 12, 13 role of the delayed slave also source of BCP (business continuity planning - prevention and recovery of threats)
  96. 96. pt-slave-delay --daemonize --pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log --delay 4h --interval 1m --nocontinue Tuesday, November 12, 13 last 3 options most important, 4h delay, interval is how frequently it should check whether slave should be started or stopped nocontinue - don’t continue replication normally on exit (don’t catch up with master) user/pass eliminated for brevity
  97. 97. Shard Pair R/W Slave Tuesday, November 12, 13 R/W pt-slave-delay row based binlogs
  98. 98. Shard Pair R/W R/W HDFS Slave Parse/ Transform Vertica Tuesday, November 12, 13 in addition can use slaves to send data to other stores for offline queries 1)parse each binlog file to generate sequence file of row changes 2)apply the row changes to a previous set for the latest version
  99. 99. Schema Changes Tuesday, November 12, 13 alters take forever, lock rows being altered (this is why we have new things like online schema change)
  100. 100. shard 1 shard 2 shard N Tuesday, November 12, 13 LOTS of servers to apply changes to PLUS the alter problem
  101. 101. shard 1 shard 2 Tuesday, November 12, 13 apply to a side that is inactive shard N
  102. 102. Schemanator Tuesday, November 12, 13
  103. 103. Tuesday, November 12, 13 !! explain the config push process a bit also this is used to apply the alters
  104. 104. Tuesday, November 12, 13
  105. 105. shard 1 Tuesday, November 12, 13 shard 2 shard N
  106. 106. shard 1 shard 2 SET SQL_LOG_BIN = 0; ALTER TABLE user .... Tuesday, November 12, 13 shard N
  107. 107. Tuesday, November 12, 13
  108. 108. Tuesday, November 12, 13 check two things in test phase: - schema applies to blank db - table validates against our sql standards
  109. 109. shard migration Tuesday, November 12, 13 migration of data from one shard to another
  110. 110. Why? Tuesday, November 12, 13 why migrate data?
  111. 111. Prevent disk from filling Tuesday, November 12, 13
  112. 112. Prevent disk from filling High traffic objects (shops, users) Tuesday, November 12, 13 high traffic == disk usage and I/O util
  113. 113. Prevent disk from filling High traffic objects (shops, users) Shard rebalancing Tuesday, November 12, 13 rebalancing when adding new shards or shards fill unequally
  114. 114. When? Tuesday, November 12, 13
  115. 115. Tuesday, November 12, 13 users per shard
  116. 116. Balance Tuesday, November 12, 13 how many users on each shard
  117. 117. per object migration <object type> <object id> <shard> # migrate_object User 5307827 2 Tuesday, November 12, 13
  118. 118. percentage migration <object type> <percent> <old shard> <new shard> # migrate_pct User 25 3 6 Tuesday, November 12, 13
  119. 119. index user_id Tuesday, November 12, 13 migration_lock old_shard_id 1 shard 1 shard_id 1 0 0 shard 2 shard N
  120. 120. index user_id shard_id migration_lock old_shard_id 1 1 1 0 •Lock shard 1 shard 2 shard N Tuesday, November 12, 13 explain about the lock, what happens in app, reads vs. writes
  121. 121. index user_id shard_id migration_lock old_shard_id 1 1 1 0 •Lock •Migrate shard 1 Tuesday, November 12, 13 shard 2 shard N
  122. 122. index user_id shard_id migration_lock old_shard_id 1 1 1 0 •Lock •Migrate •Checksum shard 1 shard 2 Tuesday, November 12, 13 checksum is a count(*) on each table shard N
  123. 123. index user_id shard_id migration_lock old_shard_id 1 1 1 0 •Lock •Migrate •Checksum shard 1 Tuesday, November 12, 13 shard 2 shard N
  124. 124. index user_id shard_id migration_lock old_shard_id 1 2 0 1 •Lock •Migrate •Checksum •Unlock shard 1 Tuesday, November 12, 13 shard 2 shard N
  125. 125. index user_id shard_id migration_lock old_shard_id 1 2 0 1 •Lock •Migrate •Checksum •Unlock •Delete (from old shard) shard 1 shard 2 shard N Tuesday, November 12, 13 deletes are out of band, auto-back off by looking at connection metrics
  126. 126. Logical Shards Tuesday, November 12, 13 Writing data into the new shard, deleting data from the old shard and then optimizing every single table is a large amount of work Instead can run a mysql process with many databases
  127. 127. dbshard38 mysql db_300 db_301 db_302 db_303 db_304 db_305 .... Tuesday, November 12, 13 with this, slave replication is multiplied by the number of logical shards per box (assuming even distribution of writes)
  128. 128. 'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_A' => 'mysql:host=dbshard03.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_B' => 'mysql:host=dbshard04.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_100_A' => 'mysql:host=dbshard50.ny4.etsy.com;port=3306;dbname=etsy_shard_100;user=etsy_rw', 'etsy_shard_100_B' => 'mysql:host=dbshard51.ny4.etsy.com;port=3306;dbname=etsy_shard_100;user=etsy_rw', 'etsy_shard_101_A' => 'mysql:host=dbshard50.ny4.etsy.com;port=3306;dbname=etsy_shard_101;user=etsy_rw', 'etsy_shard_101_B' => 'mysql:host=dbshard51.ny4.etsy.com;port=3306;dbname=etsy_shard_101;user=etsy_rw', same mysql instance Tuesday, November 12, 13 different database/ schema
  129. 129. Advantages • multi threaded slave • simpler migrations Tuesday, November 12, 13 In MySQL 5.6 we have multi-threaded slave but it can only do parallel processing if we have multiple MySQL schemas (databases). The cons is we have many more logical shards to maintain
  130. 130. Logical Shard Migrations Tuesday, November 12, 13 Lets walk through a logical shard migration...
  131. 131. dbshard41 dbshard61 db_300 db_301 .... db_312 dbshard42 dbshard62 db_300 db_301 .... db_312 Tuesday, November 12, 13 Suppose dbshard 41/42 have shard dbs 300 - 312 and we want to move half of them to a new shard pair (61/62)
  132. 132. dbshard41 dbshard61 db_300 db_301 .... db_312 db_300 db_301 .... db_312 restore backup dbshard42 dbshard62 db_300 db_301 .... db_312 db_300 db_301 .... db_312 Tuesday, November 12, 13 We restore last night's backup from 41 onto 61 and 62
  133. 133. dbshard41 dbshard61 db_300 db_301 .... db_312 db_300 db_301 .... db_312 slave slave dbshard42 dbshard62 db_300 db_301 .... db_312 db_300 db_301 .... db_312 Tuesday, November 12, 13 Set up 62 to slave from 61, and 61 to slave from 41 starting from where the backup stopped.
  134. 134. dbshard41 dbshard61 db_300 db_301 .... db_312 db_300 db_301 .... db_312 slave slave dbshard42 dbshard62 db_300 db_301 .... db_312 db_300 db_301 .... db_312 Tuesday, November 12, 13 Once 61 and 62 are all caught up, change the config such dbshard42 is disabled, and all writes/reads go to dbshard41
  135. 135. dbshard41 dbshard61 db_300 db_301 .... db_312 db_307 dbshard42 db_300 db_301 .... db_312 slave config: db_307-312 change from dbshard 41 to 61 .... db_312 slave dbshard62 db_300 db_301 .... db_312 Tuesday, November 12, 13 Then change the config for db_307 through 312 on dbshard41 to point to dbshard61.
  136. 136. dbshard41 dbshard61 db_300 db_301 .... db_312 db_307 .... db_312 slave dbshard42 dbshard62 db_300 db_301 .... db_312 db_300 db_301 .... db_312 Tuesday, November 12, 13 Reset dbshard61 slave to point to dbshard62 instead. So now we have MasterMaster going.
  137. 137. dbshard41 dbshard61 db_300 db_301 .... db_312 db_307 dbshard42 db_300 db_301 .... db_312 .... db_312 config: db_307-312 change from dbshard 42 to 62 slave dbshard62 db_307 .... db_312 Tuesday, November 12, 13 Change db_307 through 312 on dbshard42 in the config to point to dbshard62.
  138. 138. dbshard41 dbshard61 db_300 db_301 .... db_306 db_307 .... db_312 slave dbshard42 dbshard62 db_300 db_301 .... db_306 db_307 .... db_312 Tuesday, November 12, 13 And we're done. Drop db_307 through db_312 on dbshard41/42, re-enable writes on 42
  139. 139. Other Tools Tuesday, November 12, 13
  140. 140. mysqlsummary Tuesday, November 12, 13 essentially just reformatting show processlist
  141. 141. % mysqlsummary.pl --host dbshard31 Details for dbshard31 ================================== COMMAND SUMMARY =============== Sleep Execute Connect Binlog Dump Query Tuesday, November 12, 13 211 2 2 2 1 (96.79%) (0.92%) (0.92%) (0.92%) (0.46%)
  142. 142. HOST SUMMARY ============ meteor03 meteor01 web0228 api05 worker05 worker12 10 8 3 3 3 3 SCRIPT SUMMARY ============== Job: ShopStats/calculate Job: NewsFeed/refresh 1 (0.46%) 1 (0.46%) SQL SUMMARY =========== select SELECT SHOW Tuesday, November 12, 13 1 (0.46%) 1 (0.46%) 1 (0.46%) (4.59%) (3.67%) (1.38%) (1.38%) (1.38%) (1.38%)
  143. 143. COMMAND TIMINGS =============== ---------------------------------------------------------------------+ HOST: worker19, USER: , DB: 2, TIME: 4 ---------------------------------------------------------------------select * from activity where owner_id = 7395036 and owner_type_id = 2 and deleted = 0 and creation_time >= 1382226430 and public = 1 order by creation_time desc limit 0,50 ---------------------------------------------------------------------+ HOST: worker27, USER: , DB: 2, TIME: 4 ---------------------------------------------------------------------SELECT * FROM shop_stats WHERE shop_id = 5902046 AND currency_code = 'USD' AND sales_year = 2012 AND id != 2432609442 Tuesday, November 12, 13
  144. 144. ORM REPL Tuesday, November 12, 13
  145. 145. % php-repl [1] etsy-php> EtsyORM::getFinder('User'); → object(EtsyModel_UserFinder)( 0 => 'countAll( SELECT count(*) FROM User )', 1 => 'findByLoginName ( $login_name )', 2 => 'findByEmail ( $primary_email )', ... Tuesday, November 12, 13
  146. 146. qtop Tuesday, November 12, 13 we send queries over UDP from our ORM, stick them in a db and to analyze later request context: request id, logged in user-id, what script is executing avoid the perf hit of slow query log, and its realtime across all shards because it originates from the client
  147. 147. Thank you etsy.com/jobs Tuesday, November 12, 13

×