The Shard Revisited: Tools and Techniques Used at Etsy

The Shard Revisited
Tools and Techniques Used at Etsy

jgoulah@etsy.com / @johngoulah

Tuesday, November 12, 13


A marketplace for people around the world to connect, buy, and sell
unique goods
Etsy is the marketplace that we all make together,
and our mission is to re-imagine commerce in ways that build a more
fulﬁlling and lasting world

60MM+ unique visitors/mo.
1.5B+ page views / mo.
1M+ shops / 200 countries
895MM sales in 2012


this talk consists of the architecture, our dev data problem/solution, and
other tools
big cluster, 35 shards

6TB InnoDB buﬀer pool
30TB+ data stored
100K+ queries/sec avg
~1.8Gbps outbound (plain text)
99.9% queries under 1ms

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

~100 MySQL servers
1100 15K rpm disks / 1600+ CPU’s
Server Spec
HP DL 380 G8
96GB RAM
16 spindles / 2TB RAID 10
24 Core

16 x 146GB

Architecture

2 key concerns when you reach scale....

Redundancy

the duplication of critical components of a system with the intention of
increasing reliability
example: jet engines

Master - Master
R/W


duplication of critical components....

R/W

Master - Master
R/W

R/W

Side A

Side B


we call these sides “replicants”

Scalability

the ability of a system to handle growing amount of work in a capable
manner
(grocery store example)

shard 1

shard 2

shard N

...


horizontal scaling

shard 1

shard 2

shard N

...

shard N + 1


horizontal scaling

shard 1

shard 2

shard N

...
Migrate

Migrate
shard N + 1


horizontal scaling

Migrate

Bird’s-Eye View


http://www.ﬂickr.com/photos/feuilllu/36612719/sizes/l/in/
photostream/

tickets

shard 1

index

shard 2


3 main components
couple others, dbaux, dbtasks

shard N

tickets

index

Unique IDs
shard 1


shard 2

shard N

tickets

index

Shard Lookup
shard 1


shard 2

shard N

tickets

shard 1

index

shard 2

Store/Retrieve Data

shard N

Basics

what is sharding?

users_groups
user_id

group_id

1

A

1

B

2

A

2

C

3

A

3

B

3

C


users_groups
user_id

group_id

1

A

1

B

2

A

2

C

3

A

3

B

3

C


creating horizontal partitions from a table

users_groups
user_id

group_id

1

A

1

B

2

A

user_id

group_id

2

C

3

A

3

A

3

B

3

B

3

C

3

C


users_groups
shard 1
user_id

group_id

1

A

1

B

2

A

user_id

group_id

2

C

3

A

3

B

3

C


shard 2

Index Servers

have to be able to ﬁnd the data, these simply exist to look up where the
data is
to answer the question: what shard is the data on?
http://www.ﬂickr.com/photos/mamsy/4175783446/sizes/l/in/
photostream/

index

shard 1

shard 2


want to ﬁnd details for a user

shard N

index

shard 1

select shard_id from user_index
where user_id = X

shard 2


ﬁrst get the shard id, have the PK

shard N

index

select shard_id from user_index
where user_id = X

returns 1
shard 1


shard 2

shard N

index

shard 1


select join_date from users
where user_id = X

shard 2

shard N

index

select join_date from users
where user_id = X

returns 2012-02-05
shard 1


shard 2

shard N

Ticket Servers

http://www.ﬂickr.com/photos/rexroof/5126088323/sizes/l/in/
photostream/

Globally Unique ID

can’t use auto-increment with distributed system, hand out globally
unique id’s

CREATE TABLE `tickets` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`stub` char(1) NOT NULL default '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM


only myisam tables, leverage myisam engine's lack of concurrency

Ticket Generation
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();


since value ‘a’ exists, it replaces the row with the same value (and bumps
the id)
if an old row in the table has the same value as a new row for a
PK or a UNIQUE index, the old row is deleted before the new row is
inserted

Ticket Generation
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
SELECT * FROM tickets;
id
4589294


stub
a

tickets A
auto-increment-increment = 2
auto-increment-oﬀset = 1

tickets B


ODD:offset=1
EVEN: offset=2
http://openclipart.org/detail/94723/database-symbol-by-rg1024

tickets A

tickets B

NOT master-master

failure is ok, only lose last ticket id
can bring another server up with new offset
http://openclipart.org/detail/94723/database-symbol-by-rg1024

Shards

shards hold the majority of the data
http://www.ﬂickr.com/photos/merrickb/63999750/sizes/o/in/
photostream/

Object Hashing
....aka pinning data to one side of the shard


after we determine the shard we have to determine side A or side B
given the replicant index
also helps keep connections to a (relative) minimum since all stuff
sharded by a speciﬁc instance will then pick the same side

A

user_id : 500

so we know the shard, now which replicant
object id in this case is user_id
side a/b are replicants

B

A

B

user_id : 500 % (# active replicants)

A

B

'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',
'etsy_index_B' => 'mysql:host=dbindex02.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',
'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw',
'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw',

user_id : 500 % (# active replicants)

each master master pair in the conﬁg

A

user_id : 500 % (2)

B

A

user_id : 500 % (2) == 0

B

A

user_id : 500 % (2) == 0

B

select ...
insert ...
update ...

A

B

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1

500

select ...
insert ...
update ...

A

B

501

select ...
insert ...
update ...

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1

Failure

http://www.ﬂickr.com/photos/44124348109@N01/6467405231/

A

B

'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',
'etsy_index_B' => 'mysql:host=dbindex02.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw',

user_id : 500 % (2) == 0
user_id : 501 % (2) == 1

A

B

user_id : 500 % (1) == 0
user_id : 501 % (1) == 0

Variants

variants are mirrors of the same data in different tables
http://www.ﬂickr.com/photos/garibaldi/522196113/sizes/o/in/
photostream/

shard 2

shard 1

user_id

group_id

user_id

group_id

1

A

3

A

1

B

3

B

2

A

4

A

2

C

5

C

SELECT user_id FROM users_groups WHERE group_id = ‘A’

shard 2

shard 1

user_id

group_id

user_id

group_id

1

A

3

A

1

B

3

B

2

A

4

A

2

C

5

C


Broken!


shard 2

shard 1

user_id

group_id

3

A

3

B

A

4

A

C

5

C

user_id

group_id

1

A

1

B

2
2

JOIN


Broken!


users_groups

groups_users

user_id

group_id

group_id

user_id

1

A

A

1

1

B

A

3

2

A

A

2

2

C

B

3

3

A

B

1

3

B

C

2

3

C

C

3


mirror the data, map users to groups, groups to users

users_groups_index

groups_users_index

shard_id

group_id

shard_id

1

1

A

1

2

1

B

2

3

2

C

2

4

index

user_id

3

D

3

separate indexes for
diﬀerent slices of data

users_groups_index

groups_users_index

shard_id

group_id

shard_id

1

1

A

1

2

1

B

2

3

2

C

2

4

index

user_id

3

D

3

A
B
C

4

look up the groups a user is part of

4

4


group_id

4

shard 3

user_id

D

Dev Data

now lets talk about development data

The Problem

hit this a few years ago, every big company probably has this issue

DATA


sync prod to dev, until prod data gets too big
http://www.ﬂickr.com/photos/uwwresnet/6280880034/sizes/l/in/
photostream/

Some Approaches
subsets of data
generated data

subsets have to end somewhere (a shop has favorites that are connected
to people, connected to shops, etc)
generated data can be time consuming to fake

But...

but there is a problem with both of those approaches

Edge Cases

what about testing edge cases, difficult to diagnose bugs?
hard to model the same data set that produced a user facing bug
http://www.ﬂickr.com/photos/kalexanderson/6199793967/sizes/o/in/
photostream/

Complexity


another issue is testing problems at scale, complex and large gobs of
data
real social network ecosystem can be difficult to generate (favorites,
follows)
(activity feed, “similar items” search gives better results in prod)
http://www.ﬂickr.com/photos/doug88888/4687906267/sizes/o/in/
photostream/

Copy prod data to dev ?

what most people do before data gets too big,
almost 3 days to sync 30Tb over 1Gbps link, close to 10 hrs over
10Gbps
bringing prod dataset to dev was expensive hardware/maint,
keeping parity with prod, and applying schema changes would take at least
as long

instead....

Use Production
(sometimes)

so we did what we saw as the last resort - used production
not for greenfield development, more for mature features and diagnosing bugs
we still have a dev database but the data is sparse and unreliable


goes without saying this can be dangerous, and people have to be aware
they are doing it
http://instagram.com/p/d8nw9aNqlt/
http://www.ﬂickr.com/photos/stuckincustoms/432361985/sizes/l/in/
photostream/

introducing....

dev shard

dev shard, shard used for initial writes of data created when coming from dev
env

tickets

shard 1


index

shard 2

shard N

tickets

shard 1

index

shard 2

shard N
DEV shard


Initial Writes
www.etsy.com

shard 1

www.goulah.vm

shard 2

shard N
DEV shard


Initial Writes
www.etsy.com

shard 1

www.goulah.vm

shard 2

shard N
DEV shard


writes from etsy.com go everywhere -except- dev shard

Initial Writes
www.etsy.com

shard 1

www.goulah.vm

shard 2

shard N
DEV shard


writes from my vm -only- go to dev shard

mysql proxy


proxy hits all of the shards/index/tickets
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

explicitly enabled
% dev_proxy on
Dev-Proxy config is now ON. Use
'dev_proxy off' to turn it off.


Not on all the time

visual notiﬁcations



notify engineers they are using the proxy,
this is read-only mode

read/write mode



read-write mode, needed for login and other things that write data

% ./bin/myscript
YOU CURRENTLY HAVE THE READ WRITE PROXY TURNED ON AND ARE
RUNNING A CLI SCRIPT!!!
You must type the phrase 'read write proxy' and press enter to continue...


known input/output


we know where all of the queries from dev originate from
http://www.ﬂickr.com/photos/medevac71/4875526920/sizes/l/in/
photostream/

dangerous/unnecessary queries
(DEV) etsy_rw@jgoulah [test]>
select * from fred_test;
ERROR 9001 (E9001): Selects from
tables must have where clauses


-- filter dangerous queries - (queries without a WHERE)
-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER
statements don’t run from dev)

logging


basics of anomaly detection is log collection

2013-04-22 18:05:43 485370821 devproxy --

date

thread id

/* DEVPROXY source=10.101.194.19:40198

source ip
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

unique id generated by proxy
[htSp8458VmHlC] [etsy_index_B] [browse.php] */

app request id
SELECT id FROM table;


dest. shard

script

stealth data


hiding data from users
(favorites go on dev and prod shard, making sure test user/shops don’t
show up in search)
http://www.ﬂickr.com/photos/davidyuweb/8063097077/sizes/h/in/
photostream/

overlays

An overlay is a local copy of production data
If there are overlays in place in dev, it will send the queries to the local db
instead
(it does this by overriding looking up the shard on index, and checks for table/pk
pair).

prod
user_id

group_id

1

A

1

B

2

A

2

dev
copy

user_id

group_id

C

3

A

3

A

3

B

3

B

3

C

3

C

store in memcache: <table, pk>

Any time we write to the other shards from dev,
the shard migration copies to be affected rows to their local mysql
instance over the dev proxy
and then stores the table/pk for subsequent lookup

Delayed Slaves


pt-slave-delay watches a slave and starts and stops its replication SQL thread as
necessary to hold it

http://www.ﬂickr.com/photos/xploded/141295823/sizes/o/in/
photostream/

Delayed Slaves
4 hour delay behind master
produce row based binary logs
allow for quick recovery

role of the delayed slave
also source of BCP
(business continuity planning - prevention and recovery of threats)

pt-slave-delay --daemonize
--pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log
--delay 4h --interval 1m --nocontinue


last 3 options most important,
4h delay, interval is how frequently it should check whether slave
should be started or stopped
nocontinue - don’t continue replication normally on exit (don’t catch
up with master)
user/pass eliminated for brevity

Shard Pair
R/W

Slave

R/W

pt-slave-delay
row based binlogs

Shard Pair
R/W

R/W
HDFS

Slave

Parse/
Transform

Vertica


in addition can use slaves to send data to other stores for offline queries
1)parse each binlog ﬁle to generate sequence ﬁle of row changes
2)apply the row changes to a previous set for the latest version

Schema Changes

alters take forever, lock rows being altered
(this is why we have new things like online schema change)

shard 1

shard 2

shard N


LOTS of servers to apply changes to PLUS the alter problem

shard 1

shard 2


apply to a side that is inactive

shard N

Schemanator


!! explain the conﬁg push process a bit
also this is used to apply the alters

shard 1


shard 2

shard N

shard 1

shard 2

SET SQL_LOG_BIN = 0; ALTER TABLE user ....

shard N


check two things in test phase:
- schema applies to blank db
- table validates against our sql standards

shard migration

migration of data from one shard to another

Why?


why migrate data?

Prevent disk from ﬁlling


High traﬃc objects (shops, users)


high traffic == disk usage and I/O util

High traﬃc objects (shops, users)
Shard rebalancing


rebalancing when adding new shards or shards ﬁll unequally

When?



users per shard

Balance


how many users on each shard

per object migration
<object type> <object id> <shard>

# migrate_object User 5307827 2


percentage migration
<object type> <percent> <old shard> <new shard>

# migrate_pct User 25 3 6


index
user_id


migration_lock

old_shard_id

1

shard 1

shard_id
1

0

0

shard 2

shard N

index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock

shard 1

shard 2

shard N


explain about the lock, what happens in app, reads vs. writes

index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock
•Migrate

shard 1


shard 2

shard N

index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock
•Migrate
•Checksum
shard 1

shard 2


checksum is a count(*) on each table

shard N

index
user_id

shard_id

migration_lock

old_shard_id

1

1

1

0

•Lock
•Migrate
•Checksum
shard 1


shard 2

shard N

index
user_id

shard_id

migration_lock

old_shard_id

1

2

0

1

•Lock
•Migrate
•Checksum
•Unlock
shard 1


shard 2

shard N

index
user_id

shard_id

migration_lock

old_shard_id

1

2

0

1

•Lock
•Migrate
•Checksum
•Unlock
•Delete (from old shard)
shard 1

shard 2

shard N


deletes are out of band, auto-back off by looking at connection metrics

Logical Shards

Writing data into the new shard, deleting data from the old shard and then optimizing
every single table is a large amount of work
Instead can run a mysql process with many databases

dbshard38
mysql
db_300
db_301
db_302
db_303
db_304
db_305
....


with this, slave replication is multiplied by the number of logical shards
per box
(assuming even distribution of writes)

'etsy_shard_100_A' => 'mysql:host=dbshard50.ny4.etsy.com;port=3306;dbname=etsy_shard_100;user=etsy_rw',
'etsy_shard_100_B' => 'mysql:host=dbshard51.ny4.etsy.com;port=3306;dbname=etsy_shard_100;user=etsy_rw',
'etsy_shard_101_A' => 'mysql:host=dbshard50.ny4.etsy.com;port=3306;dbname=etsy_shard_101;user=etsy_rw',
'etsy_shard_101_B' => 'mysql:host=dbshard51.ny4.etsy.com;port=3306;dbname=etsy_shard_101;user=etsy_rw',

same mysql instance

diﬀerent database/
schema

Advantages
• multi threaded slave
• simpler migrations


In MySQL 5.6 we have multi-threaded slave
but it can only do parallel processing if we have multiple MySQL schemas
(databases).
The cons is we have many more logical shards to maintain

Logical Shard Migrations


Lets walk through a logical shard migration...

dbshard41

dbshard61

db_300
db_301
....
db_312

dbshard42

dbshard62

db_300
db_301
....
db_312

Suppose dbshard 41/42 have shard dbs 300 - 312 and we want to move half of
them to a new shard pair (61/62)

dbshard41

dbshard61

db_300
db_301
....
db_312

db_300
db_301
....
db_312

restore
backup

dbshard42

dbshard62

db_300
db_301
....
db_312

db_300
db_301
....
db_312


We restore last night's backup from 41 onto 61 and 62

dbshard41

dbshard61

db_300
db_301
....
db_312

db_300
db_301
....
db_312

slave

slave
dbshard42

dbshard62

db_300
db_301
....
db_312

db_300
db_301
....
db_312


Set up 62 to slave from 61, and 61 to slave from 41 starting from where the
backup stopped.

dbshard41

dbshard61

db_300
db_301
....
db_312

db_300
db_301
....
db_312

slave

slave
dbshard42

dbshard62

db_300
db_301
....
db_312

db_300
db_301
....
db_312


Once 61 and 62 are all caught up, change the config such dbshard42 is
disabled, and all writes/reads go to dbshard41

dbshard41

dbshard61

db_300
db_301
....
db_312

db_307

dbshard42
db_300
db_301
....
db_312

slave

conﬁg:
db_307-312
change from
dbshard 41 to 61

....
db_312

slave
dbshard62
db_300
db_301
....
db_312


Then change the config for db_307 through 312 on dbshard41 to point to
dbshard61.

dbshard41

dbshard61

db_300
db_301
....
db_312

db_307
....
db_312

slave
dbshard42

dbshard62

db_300
db_301
....
db_312

db_300
db_301
....
db_312


Reset dbshard61 slave to point to dbshard62 instead. So now we have MasterMaster going.

dbshard41

dbshard61

db_300
db_301
....
db_312

db_307

dbshard42
db_300
db_301
....
db_312

....
db_312

conﬁg:
db_307-312
change from
dbshard 42 to 62

slave
dbshard62
db_307
....
db_312


Change db_307 through 312 on dbshard42 in the config to point to dbshard62.

dbshard41

dbshard61

db_300
db_301
....
db_306

db_307
....
db_312

slave
dbshard42

dbshard62

db_300
db_301
....
db_306

db_307
....
db_312


And we're done. Drop db_307 through db_312 on dbshard41/42, re-enable
writes on 42

Other Tools

mysqlsummary

essentially just reformatting show processlist

% mysqlsummary.pl --host dbshard31
Details for dbshard31
==================================
COMMAND SUMMARY
===============
Sleep
Execute
Connect
Binlog Dump
Query


211
2
2
2
1

(96.79%)
(0.92%)
(0.92%)
(0.92%)
(0.46%)

HOST SUMMARY
============
meteor03
meteor01
web0228
api05
worker05
worker12

10
8
3
3
3
3

SCRIPT SUMMARY
==============
Job: ShopStats/calculate
Job: NewsFeed/refresh

1 (0.46%)
1 (0.46%)

SQL SUMMARY
===========
select
SELECT
SHOW


1 (0.46%)
1 (0.46%)
1 (0.46%)

(4.59%)
(3.67%)
(1.38%)
(1.38%)
(1.38%)
(1.38%)

COMMAND TIMINGS
===============
---------------------------------------------------------------------+ HOST: worker19, USER: , DB: 2, TIME: 4
---------------------------------------------------------------------select * from activity where owner_id = 7395036 and owner_type_id = 2
and deleted = 0 and creation_time >= 1382226430 and public = 1 order
by creation_time desc limit 0,50
---------------------------------------------------------------------+ HOST: worker27, USER: , DB: 2, TIME: 4
---------------------------------------------------------------------SELECT * FROM shop_stats WHERE shop_id = 5902046 AND currency_code =
'USD' AND sales_year = 2012 AND id != 2432609442


ORM REPL

% php-repl
[1] etsy-php> EtsyORM::getFinder('User');
→ object(EtsyModel_UserFinder)(
0 => 'countAll( SELECT count(*) FROM User )',
1 => 'ﬁndByLoginName ( $login_name )',
2 => 'ﬁndByEmail ( $primary_email )',
...


qtop

we send queries over UDP from our ORM, stick them in a db and to
analyze later
request context: request id, logged in user-id, what script is executing
avoid the perf hit of slow query log, and its realtime across all shards
because it originates from the client

Thank you
etsy.com/jobs

The Shard Revisited: Tools and Techniques Used at Etsy

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Recently uploaded

Recently uploaded (20)

The Shard Revisited: Tools and Techniques Used at Etsy