http://danga.com/words/
LiveJournal: Behind The Scenes
Scaling Storytime
June 2007
USENIX
Brad Fitzpatrick
brad@danga.com
...
http://danga.com/words/
The plan...
 Refer to previous presentations for more
details...
 http://danga.com/words/
 Ques...
http://danga.com/words/
LiveJournal Backend: Today
(Roughly.)
User DB Cluster 1
uc1a uc1b
User DB Cluster 2
uc2a uc2b
User...
http://danga.com/words/
LiveJournal Overview
 college hobby project, Apr 1999
 4-in-1:
− blogging
− forums
− social-netw...
http://danga.com/words/
 memcached
− distributed caching
 MogileFS
− distributed filesystem
 Perlbal
− HTTP load balanc...
http://danga.com/words/
“Uh, why?”
 NIH? (Not Invented Here?)
 Are we reinventing the wheel?
6
http://danga.com/words/
Yes.
 We build wheels.
− ... when existing suck,
− ... or don’t exist.
7
http://danga.com/words/
Yes.
 We build wheels.
− ... when existing suck,
− ... or don’t exist.
7
http://danga.com/words/
Yes.
 We build wheels.
− ... when existing suck,
− ... or don’t exist.
7
http://danga.com/words/
Yes.
 We build wheels.
− ... when existing suck,
− ... or don’t exist.
(yes, arguably tires. sshh...
http://danga.com/words/
Part I
Quick Scaling History
8
http://danga.com/words/
Quick Scaling History
 1 server to hundreds...
 you can do all this with just 1 server!
− then y...
http://danga.com/words/
Terminology
 Scaling:
− NOT: “How fast?”
− But: “When you add twice as many servers, are you
twic...
http://danga.com/words/
Terminology
 “Cluster”
− varying definitions... basically:
− making a bunch of computers work tog...
http://danga.com/words/
LB vs. HA
Load Balancing High Availability
12
http://danga.com/words/
LB vs. HA
Load
Balancing
High
Availability
http
reverse proxy,
wackamole,
...
round-robin DNS,
dat...
http://danga.com/words/
Favorite Venn Diagram
Times When
I’m Truly Happy
Times When I’m
Wearing Pants
14
http://danga.com/words/
One Server
 Simple:
mysql
apache
15
http://danga.com/words/
Two Servers
mysql
apache
16
http://danga.com/words/
Two Servers - Problems
 Two single points of failure!
 No hot or cold spares
 Site gets slow ag...
http://danga.com/words/
Four Servers
 3 webs, 1 db
 Now we need to load-balance!
 LVS, mod_backhand, whackamole, BIG-IP...
http://danga.com/words/
Four Servers - Problems
 Now I/O bound...
 ... how to use another database?
−
19
http://danga.com/words/
Five Servers
introducing MySQL replication
 We buy a new DB
 MySQL replication
 Writes to DB (m...
http://danga.com/words/
More Servers
Chaos!
21
http://danga.com/words/
Where we're at....
mod_perl
web4
web3
web2
web12
...
web1
BIG-IP
bigip2
bigip1
mod_proxy
proxy3
pr...
http://danga.com/words/
Problems with Architecture
or,
“This don't scale...”
 DB master is SPOF
 Adding slaves doesn't s...
http://danga.com/words/
Eventually...
 databases eventual only writing
400 write/s
3 reads/s
400
write/s
3 r/s
400 write/...
http://danga.com/words/
Spreading Writes
 Our database machines already did RAID
 We did backups
 So why put user data ...
http://danga.com/words/
Partition your data!
 Spread your databases out, into “roles”
− roles that you never need to join...
http://danga.com/words/
User Clusters
27
http://danga.com/words/
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
27
http://danga.com/words/
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
userid: 839
clusterid: 2
27
http://danga.com/words/
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
userid: 839
clusterid: 2
SELECT ...
http://danga.com/words/
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
userid: 839
clusterid: 2
SELECT ...
http://danga.com/words/
Details
 per-user numberspaces
− don't use AUTO_INCREMENT
− PRIMARY KEY (user_id, thing_id)
− so:...
http://danga.com/words/
Shared Storage
(SAN, SCSI, DRBD...)
 Turn pair of InnoDB machines into a cluster
− looks like 1 b...
http://danga.com/words/
Shared Storage: DRBD
 Linux block device driver
− “Network RAID 1”
− Shared storage without shari...
http://danga.com/words/
MySQL Clustering Options:
Pros & Cons
 No magic bullet...
− Master/Slave
 doesn’t scale with wri...
http://danga.com/words/
Part II
Our Software
32
http://danga.com/words/
Caching
 caching's key to performance
− store result of a computation or I/O for quicker future
a...
http://danga.com/words/
memcached
http://www.danga.com/memcached/
 our Open Source, distributed caching system
 implemen...
http://danga.com/words/
Protocol Commands
 set, add, replace
 delete
 incr, decr
− atomic, returning new value
35
http://danga.com/words/
Picture
36
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
36
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
36
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
0 1 2 3
36
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
Client
0 1 2 3
36
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
Client
0 1 2 3
$val = $clie...
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
Client
0 1 2 3
$val = $clie...
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
Client
0 1 2 3
$val = $clie...
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
Client
0 1 2 3
$val = $clie...
http://danga.com/words/
Picture
10.0.0.100:11211
1GB
10.0.0.101:11211
2GB
10.0.0.102:11211
1GB
Client
0 1 2 3
$val = $clie...
http://danga.com/words/
Client hashing onto a memcacached
node
 Up to client how to pick a memcached node
 Traditional w...
http://danga.com/words/
memcached internals
 libevent
− epoll, kqueue...
 event-based, non-blocking design
− optional mu...
http://danga.com/words/
Perlbal
39
http://danga.com/words/
Web Load Balancing
 BIG-IP, Alteon, Juniper, Foundry
− good for L4 or minimal L7
− not tricky / f...
http://danga.com/words/
Perlbal
 Perl
 parts optionally in C with plugins
 single threaded, async event-based
− uses ep...
http://danga.com/words/
Perlbal: Persistent Connections
42
http://danga.com/words/
Perlbal: Persistent Connections
 perlbal to backends (mod_perls)
− know exactly when a connection...
http://danga.com/words/
Perlbal: Persistent Connections
 perlbal to backends (mod_perls)
− know exactly when a connection...
http://danga.com/words/
Perlbal: Persistent Connections
 perlbal to backends (mod_perls)
− know exactly when a connection...
http://danga.com/words/
Perlbal: Persistent Connections
 perlbal to backends (mod_perls)
− know exactly when a connection...
http://danga.com/words/
Perlbal: can verify new backend
connections
 connects to backends are often fast, but...
 are yo...
http://danga.com/words/
Perlbal: multiple queues
 high, normal, low priority queues
 paid users -> high queue
 bots/spi...
http://danga.com/words/
Perlbal: cooperative large file serving
 large file serving w/ mod_perl bad...
− mod_perl has bet...
http://danga.com/words/
Perlbal: cooperative large file serving
 internal redirects
− mod_perl can pass off serving a big...
http://danga.com/words/
Internal redirect picture
47
http://danga.com/words/
And the reverse...
 Now Perlbal can buffer uploads as well..
− Problems:
 LifeBlog uploading
−ce...
http://danga.com/words/
Palette Altering GIF/PNGs
 based on palette indexes, colors in URL,
dynamically alter GIF/PNG pal...
http://danga.com/words/
MogileFS
50
http://danga.com/words/
oMgFileS
51
http://danga.com/words/
MogileFS
 our distributed file system
 open source
 userspace
 based all around HTTP (NFS supp...
http://danga.com/words/
MogileFS: Why
 alternatives at time were either:
− closed, non-existent, expensive, in developmen...
http://danga.com/words/
MogileFS: Main Ideas
 files belong to classes,
which dictate:
− replication policy, min
replicas,...
http://danga.com/words/
MogileFS components
 clients
 mogilefsd (does all real work)
 database(s) (MySQL, .... abstract...
http://danga.com/words/
MogileFS: Clients
 tiny text-based protocol
 Libraries available for:
− Perl
 tied filehandles
...
http://danga.com/words/
MogileFS: Tracker
(mogilefsd)
 The Meat
 event-based message bus
 load balances client requests...
http://danga.com/words/
Trackers' Database(s)
 Abstract as of Mogile 2.x
− MySQL
− SQLite (joke/demo)
− Pg/Oracle coming ...
http://danga.com/words/
MogileFS storage nodes
(mogstored)
 HTTP transport
− GET
− PUT
− DELETE
 mogstored listens on 2 ...
http://danga.com/words/
Large file
GET
request
60
http://danga.com/words/
Large file
GET
request
Auth: complex, but quick
60
http://danga.com/words/
Large file
GET
request
Auth: complex, but quick
Spoonfeeding:
slow, but event-
based
60
http://danga.com/words/
Gearman
61
http://danga.com/words/
manaGer
62
http://danga.com/words/
Manager
dispatches work,
but doesn't do anything useful itself. :)
63
http://danga.com/words/
Gearman
 system to load balance function calls...
 scatter/gather bunch of calls in parallel,
 ...
http://danga.com/words/
Gearman Pieces
 gearmand
− the function call router
− event-loop (epoll, kqueue, etc)
 workers.
...
http://danga.com/words/
Gearman Picture
66
http://danga.com/words/
Gearman Picture
gearmand gearmand gearmand
66
http://danga.com/words/
Gearman Picture
Worker Worker
gearmand gearmand gearmand
66
http://danga.com/words/
Gearman Picture
Worker Worker
gearmand gearmand gearmand
can_do(“funcA”)
can_do(“funcA”)
can_do(“f...
http://danga.com/words/
Gearman Picture
Client Worker Worker
gearmand gearmand gearmand
can_do(“funcA”)
can_do(“funcA”)
ca...
http://danga.com/words/
Gearman Picture
Client Worker Worker
gearmand gearmand gearmand
call(“funcA”)
can_do(“funcA”)
can_...
http://danga.com/words/
Gearman Picture
Client Client Worker Worker
gearmand gearmand gearmand
call(“funcA”)
can_do(“funcA...
http://danga.com/words/
Gearman Picture
Client Client Worker Worker
gearmand gearmand gearmand
call(“funcA”)
call(“funcB”)...
http://danga.com/words/
Gearman Protocol
 efficient binary protocol
 No XML
 but also line-based text protocol for admi...
http://danga.com/words/
Gearman Uses
 Image::Magick outside of your mod_perls!
 DBI connection pooling (DBD::Gofer +
Gea...
http://danga.com/words/
Gearman Uses, cont..
 running code in parallel
− query ten databases at once
 running blocking c...
http://danga.com/words/
Gearman Misc
 Guarantees:
− none! hah! :)
 please wait for your results.
 if client goes away, ...
http://danga.com/words/
Sick Gearman Demo
 Don’t actually use it like this... but:
use strict;
use DMap qw(dmap);
DMap->s...
http://danga.com/words/
Gearman Summary
 Gearman is sexy.
− especially the coalescing
 Check it out!
− it's kinda our li...
http://danga.com/words/
TheSchwartz
73
http://danga.com/words/
TheSchwartz
 Like gearman:
− job queuing system
− opaque function name
− opaque “args” blob
− cli...
http://danga.com/words/
TheSchwartz Primitives
 insert job
 “grab” job (atomic grab)
− for 'n' seconds.
 mark job done
...
http://danga.com/words/
TheSchwartz
 backing store:
− a database
− uses Data::ObjectDriver
 MySQL,
 Postgres,
 SQLite,...
http://danga.com/words/
TheSchwartz uses
 outgoing email (SMTP client)
− millions of emails per day
− TheSchwartz::Worker...
http://danga.com/words/
gearmand + TheSchwartz
 gearmand: not reliable, low-latency, no disks
 TheSchwartz: latency, rel...
http://danga.com/words/
djabberd
79
http://danga.com/words/
djabberd
 Our Jabber/XMPP server
 powers our “LJ Talk” service
 S2S: works with GoogleTalk, etc...
http://danga.com/words/
djabberd hooks
 everything is a hook
− not just auth! like, everything.
− auth,
− roster,
− vcard...
http://danga.com/words/
Thank you!
Questions to:
brad@danga.com
Software:
http://danga.com/
http://code.sixapart.com/
82
http://danga.com/words/
Questions?
User DB Cluster 1
uc1a uc1b
User DB Cluster 2
uc2a uc2b
User DB Cluster 3
uc3a uc3b
Use...
http://danga.com/words/
Bonus Slides
 if extra time
84
http://danga.com/words/
Data Integrity
 Databases depend on fsync()
− but databases can't send raw SCSI/ATA commands
to f...
http://danga.com/words/
Persistent Connection Woes
 connections == threads == memory
− My pet peeve:
 want connection/th...
Upcoming SlideShare
Loading in...5
×

usenix

1,407

Published on

Published in: Automotive, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,407
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

usenix

  1. 1. http://danga.com/words/ LiveJournal: Behind The Scenes Scaling Storytime June 2007 USENIX Brad Fitzpatrick brad@danga.com danga.com / livejournal.com / sixapart.com This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. 1
  2. 2. http://danga.com/words/ The plan...  Refer to previous presentations for more details...  http://danga.com/words/  Questions anytime! Yell. Interrupt.  Part 0: − show where talk will end up  Part I: − What is LiveJournal? Quick history. − LJ’s scaling history  Part II: − explain all our software, − explain all the moving parts 2
  3. 3. http://danga.com/words/ LiveJournal Backend: Today (Roughly.) User DB Cluster 1 uc1a uc1b User DB Cluster 2 uc2a uc2b User DB Cluster 3 uc3a uc3b User DB Cluster N ucNa ucNb Job Queues (xN) jqNa jqNb Memcached mc4 mc3 mc2 mcN ... mc1 mod_perl web4 web3 web2 webN ... web1 BIG-IP bigip2 bigip1 perlbal (httpd/proxy) proxy4 proxy3 proxy2 proxy5 proxy1 Global Database slave1 master_a master_b slave2 ... slave5 MogileFS Database mog_a mog_b Mogile Trackers tracker3tracker1 Mogile Storage Nodes ... sto2 sto8 sto1 net. djabberd djabberd djabberd gearmand gearmand1 gearmandN “workers” gearwrkN theschwkN slave1 slaveN 3
  4. 4. http://danga.com/words/ LiveJournal Overview  college hobby project, Apr 1999  4-in-1: − blogging − forums − social-networking (“friends”) − aggregator: “friends page” − “friends” can be external RSS/Atom  10M+ accounts  Open Source! − server, − infrastructure, − original clients, 4
  5. 5. http://danga.com/words/  memcached − distributed caching  MogileFS − distributed filesystem  Perlbal − HTTP load balancer, web server, swiss-army knife  gearman − LB/HA/coalescing low- latency function call “router”  TheSchwartz − reliable, async job dispatch system  djabberd − the super-extensible everything-is-a-plugin mod_perl/qpsmtpd/ Eclipse of XMPP/Jabber servers  .....  OpenID  federated identity protocol Stuff we've built... (all production, open source) 5
  6. 6. http://danga.com/words/ “Uh, why?”  NIH? (Not Invented Here?)  Are we reinventing the wheel? 6
  7. 7. http://danga.com/words/ Yes.  We build wheels. − ... when existing suck, − ... or don’t exist. 7
  8. 8. http://danga.com/words/ Yes.  We build wheels. − ... when existing suck, − ... or don’t exist. 7
  9. 9. http://danga.com/words/ Yes.  We build wheels. − ... when existing suck, − ... or don’t exist. 7
  10. 10. http://danga.com/words/ Yes.  We build wheels. − ... when existing suck, − ... or don’t exist. (yes, arguably tires. sshh..) 7
  11. 11. http://danga.com/words/ Part I Quick Scaling History 8
  12. 12. http://danga.com/words/ Quick Scaling History  1 server to hundreds...  you can do all this with just 1 server! − then you’re ready for tons of servers, without pain − don’t repeat our scaling mistakes 9
  13. 13. http://danga.com/words/ Terminology  Scaling: − NOT: “How fast?” − But: “When you add twice as many servers, are you twice as fast (or have twice the capacity)?”  Fast still matters, − 2x faster: 50 servers instead of 100...  that’s some good money − but that’s not what scaling is. 10
  14. 14. http://danga.com/words/ Terminology  “Cluster” − varying definitions... basically: − making a bunch of computers work together for some purpose − what purpose?  load balancing (LB),  high availablility (HA)  Load Balancing?  High Availability?  Venn Diagram time! − I love Venn Diagrams 11
  15. 15. http://danga.com/words/ LB vs. HA Load Balancing High Availability 12
  16. 16. http://danga.com/words/ LB vs. HA Load Balancing High Availability http reverse proxy, wackamole, ... round-robin DNS, data partitioning, .... LVS heartbeat, cold/warm/hot spare, ... 13
  17. 17. http://danga.com/words/ Favorite Venn Diagram Times When I’m Truly Happy Times When I’m Wearing Pants 14
  18. 18. http://danga.com/words/ One Server  Simple: mysql apache 15
  19. 19. http://danga.com/words/ Two Servers mysql apache 16
  20. 20. http://danga.com/words/ Two Servers - Problems  Two single points of failure!  No hot or cold spares  Site gets slow again. − CPU-bound on web node − need more web nodes... 17
  21. 21. http://danga.com/words/ Four Servers  3 webs, 1 db  Now we need to load-balance!  LVS, mod_backhand, whackamole, BIG-IP, Alteon, pound, Perlbal, etc, etc.. − ... 18
  22. 22. http://danga.com/words/ Four Servers - Problems  Now I/O bound...  ... how to use another database? − 19
  23. 23. http://danga.com/words/ Five Servers introducing MySQL replication  We buy a new DB  MySQL replication  Writes to DB (master)  Reads from both 20
  24. 24. http://danga.com/words/ More Servers Chaos! 21
  25. 25. http://danga.com/words/ Where we're at.... mod_perl web4 web3 web2 web12 ... web1 BIG-IP bigip2 bigip1 mod_proxy proxy3 proxy2 proxy1 Global Database slave1 slave2 ... slave6 master net. 22
  26. 26. http://danga.com/words/ Problems with Architecture or, “This don't scale...”  DB master is SPOF  Adding slaves doesn't scale well... − only spreads reads, not writes! 200 writes/s 200 write/s 500 reads/s 250 reads/s 200 write/s 250 reads/s 23
  27. 27. http://danga.com/words/ Eventually...  databases eventual only writing 400 write/s 3 reads/s 400 write/s 3 r/s 400 write/s 3 reads/s 400 write/s 3 r/s 400 write/s 3 reads/s 400 write/s 3 r/s 400 write/s 3 reads/s 400 write/s 3 r/s 400 write/s 3 reads/s 400 write/s 3 r/s 400 write/s 3 reads/s 400 write/s 3 r/s 400 write/s 3 reads/s 400 write/s 3 r/s 24
  28. 28. http://danga.com/words/ Spreading Writes  Our database machines already did RAID  We did backups  So why put user data on 6+ slave machines? (~12+ disks) − overkill redundancy − wasting time writing everywhere! 25
  29. 29. http://danga.com/words/ Partition your data!  Spread your databases out, into “roles” − roles that you never need to join between  different users  or accept you'll have to join in app  Each user assigned to a numbered HA cluster  Each cluster has multiple machines − writes self-contained in cluster (writing to 2-3 machines, not 6) 26
  30. 30. http://danga.com/words/ User Clusters 27
  31. 31. http://danga.com/words/ User Clusters SELECT userid, clusterid FROM user WHERE user='bob' 27
  32. 32. http://danga.com/words/ User Clusters SELECT userid, clusterid FROM user WHERE user='bob' userid: 839 clusterid: 2 27
  33. 33. http://danga.com/words/ User Clusters SELECT userid, clusterid FROM user WHERE user='bob' userid: 839 clusterid: 2 SELECT .... FROM ... WHERE userid=839 ... 27
  34. 34. http://danga.com/words/ User Clusters SELECT userid, clusterid FROM user WHERE user='bob' userid: 839 clusterid: 2 SELECT .... FROM ... WHERE userid=839 ... OMG i like totally hate my parents they just dont understand me and i h8 the world omg lol rofl *! :^- ^^; add me as a friend!!! 27
  35. 35. http://danga.com/words/ Details  per-user numberspaces − don't use AUTO_INCREMENT − PRIMARY KEY (user_id, thing_id) − so:  Can move/upgrade users 1-at-a-time: − per-user “readonly” flag − per-user “schema_ver” property − user-moving harness  job server that coordinates, distributed long- lived user-mover clients who ask for tasks − balancing disk I/O, disk space 28
  36. 36. http://danga.com/words/ Shared Storage (SAN, SCSI, DRBD...)  Turn pair of InnoDB machines into a cluster − looks like 1 box to outside world. floating IP.  One machine at a time mounting fs, running MySQL  Heartbeat to move IP, {un,}mount filesystem, {stop,start} mysql  filesystem repairs,  innodb repairs,  don’t lose any committed transactions.  No special schema considerations  MySQL 4.1 w/ binlog sync/flush options − good − The cluster can be a master or slave as well 29
  37. 37. http://danga.com/words/ Shared Storage: DRBD  Linux block device driver − “Network RAID 1” − Shared storage without sharing! − sits atop another block device − syncs w/ another machine's block device  cross-over gigabit cable ideal. network is faster than random writes on your disks.  InnoDB on DRBD: HA MySQL! − can hang slaves off HA pair, − and/or, − HA pair can be slave of a master drbd sda ext3 mysql floater ip drbd sda ext3 mysql 30
  38. 38. http://danga.com/words/ MySQL Clustering Options: Pros & Cons  No magic bullet... − Master/Slave  doesn’t scale with writes − Master/Master  special schemas − DRBD  only HA, not LB − MySQL Cluster  special-purpose − ....  lots of options! − :) − :( 31
  39. 39. http://danga.com/words/ Part II Our Software 32
  40. 40. http://danga.com/words/ Caching  caching's key to performance − store result of a computation or I/O for quicker future access (classic space/time trade-off)  Where to cache? − mod_perl/php internal caching  memory waste (address space per apache child) − shared memory  limited to single machine, same with Java/C#/ Mono − MySQL query cache  flushed per update, small max size − HEAP tables  fixed length rows, small max size 33
  41. 41. http://danga.com/words/ memcached http://www.danga.com/memcached/  our Open Source, distributed caching system  implements a dictionary ADT, with network API  run instances wherever free memory  two-level hash − client hashes* to server, − server has internal dictionary (hash table)  no “master node”, nodes aren’t aware of each other  protocol simple, XML-free − clients: c, perl, java, c#, php, python, ruby, ...  popular, fast  scalable 34
  42. 42. http://danga.com/words/ Protocol Commands  set, add, replace  delete  incr, decr − atomic, returning new value 35
  43. 43. http://danga.com/words/ Picture 36
  44. 44. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB 36
  45. 45. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB 36
  46. 46. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB 0 1 2 3 36
  47. 47. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB Client 0 1 2 3 36
  48. 48. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB Client 0 1 2 3 $val = $client->get(“foo”) 36
  49. 49. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB Client 0 1 2 3 $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 36
  50. 50. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB Client 0 1 2 3 $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 connect to server[2] (“10.0.0.101:11211”) 36
  51. 51. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB Client 0 1 2 3 $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 connect to server[2] (“10.0.0.101:11211”) GET foo 36
  52. 52. http://danga.com/words/ Picture 10.0.0.100:11211 1GB 10.0.0.101:11211 2GB 10.0.0.102:11211 1GB Client 0 1 2 3 $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 connect to server[2] (“10.0.0.101:11211”) GET foo (response) 36
  53. 53. http://danga.com/words/ Client hashing onto a memcacached node  Up to client how to pick a memcached node  Traditional way: − CRC32(<key>) % <num_servers> − (servers with more memory can own more slots) − CRC32 was least common denominator for all languages to implement, allowing cross-language memcached sharing − con: can’t add/remove servers without hit rate crashing  “Consistent hashing” − can add/remove servers with minimal <key> to <server> map changes 37
  54. 54. http://danga.com/words/ memcached internals  libevent − epoll, kqueue...  event-based, non-blocking design − optional multithreading, thread per CPU (not per client)  slab allocator  referenced counted objects − slow clients can’t block other clients from altering namespace or data  LRU  all internal operations O(1) 38
  55. 55. http://danga.com/words/ Perlbal 39
  56. 56. http://danga.com/words/ Web Load Balancing  BIG-IP, Alteon, Juniper, Foundry − good for L4 or minimal L7 − not tricky / fun enough. :-)  Tried a dozen reverse proxies − none did what we wanted or were fast enough  Wrote Perlbal − fast, smart, manageable HTTP web server / reverse proxy / LB − can do internal redirects  and dozen other tricks 40
  57. 57. http://danga.com/words/ Perlbal  Perl  parts optionally in C with plugins  single threaded, async event-based − uses epoll, kqueue, etc.  console / HTTP remote management − live config changes  handles dead nodes, smart balancing  multiple modes − static webserver − reverse proxy − plug-ins (Javascript message bus.....)  plug-ins − GIF/PNG altering, .... 41
  58. 58. http://danga.com/words/ Perlbal: Persistent Connections 42
  59. 59. http://danga.com/words/ Perlbal: Persistent Connections  perlbal to backends (mod_perls) − know exactly when a connection is ready for a new request  no complex load balancing logic: just use whatever's free. beats managing “weighted round robin” hell.  clients persistent; not tied to a specific backend connection 42
  60. 60. http://danga.com/words/ Perlbal: Persistent Connections  perlbal to backends (mod_perls) − know exactly when a connection is ready for a new request  no complex load balancing logic: just use whatever's free. beats managing “weighted round robin” hell.  clients persistent; not tied to a specific backend connection PB 42
  61. 61. http://danga.com/words/ Perlbal: Persistent Connections  perlbal to backends (mod_perls) − know exactly when a connection is ready for a new request  no complex load balancing logic: just use whatever's free. beats managing “weighted round robin” hell.  clients persistent; not tied to a specific backend connection PB Apache Apache Client Client 42
  62. 62. http://danga.com/words/ Perlbal: Persistent Connections  perlbal to backends (mod_perls) − know exactly when a connection is ready for a new request  no complex load balancing logic: just use whatever's free. beats managing “weighted round robin” hell.  clients persistent; not tied to a specific backend connection PB Apache Apache Client Client reqA1, B2 reqB1, A2 reqA1, A2 reqB1, B2 42
  63. 63. http://danga.com/words/ Perlbal: can verify new backend connections  connects to backends are often fast, but...  are you talking to the kernel’s listen queue?  or apache? (did apache accept() yet?)  send OPTIONs request to see if apache is there − Apache can reply to OPTIONS request quickly, − then Perlbal knows that conn is bound to an apache process, not waiting in a kernel queue  Huge improvement to user-visible latency!  (and more fair/even load balancing) #include <sys/socket.h> int listen(int sockfd, int backlog); 43
  64. 64. http://danga.com/words/ Perlbal: multiple queues  high, normal, low priority queues  paid users -> high queue  bots/spiders/suspect traffic -> low queue 44
  65. 65. http://danga.com/words/ Perlbal: cooperative large file serving  large file serving w/ mod_perl bad... − mod_perl has better things to do than spoon-feed clients bytes 45
  66. 66. http://danga.com/words/ Perlbal: cooperative large file serving  internal redirects − mod_perl can pass off serving a big file to Perlbal  either from disk, or from other URL(s) − client sees no HTTP redirect − “Friends-only” images  one, clean URL  mod_perl does auth, and is done.  perlbal serves. 46
  67. 67. http://danga.com/words/ Internal redirect picture 47
  68. 68. http://danga.com/words/ And the reverse...  Now Perlbal can buffer uploads as well.. − Problems:  LifeBlog uploading −cellphones are slow  LiveJournal/Friendster photo uploads −cable/DSL uploads still slow − decide to buffer to “disk” (tmpfs, likely)  on any of: rate, size, time  blast at backend, only when full request is in 48
  69. 69. http://danga.com/words/ Palette Altering GIF/PNGs  based on palette indexes, colors in URL, dynamically alter GIF/PNG palette table, then sendfile(2) the rest. 49
  70. 70. http://danga.com/words/ MogileFS 50
  71. 71. http://danga.com/words/ oMgFileS 51
  72. 72. http://danga.com/words/ MogileFS  our distributed file system  open source  userspace  based all around HTTP (NFS support now removed)  hardly unique − Google GFS − Nutch Distributed File System (NDFS)  production-quality − lot of users − lot of big installs 52
  73. 73. http://danga.com/words/ MogileFS: Why  alternatives at time were either: − closed, non-existent, expensive, in development, complicated, ... − scary/impossible when it came to data recovery  new/uncommon/ unstudied on-disk formats  because it was easy − initial version = 1 weekend! :) − current version = many, many weekends :) 53
  74. 74. http://danga.com/words/ MogileFS: Main Ideas  files belong to classes, which dictate: − replication policy, min replicas, ...  tracks what disks files are on − set disk's state (up, temp_down, dead) and host  keep replicas on devices on different hosts − (default class policy) − No RAID! − multiple tracker databases − all share same database cluster (MySQL, etc..)  big, cheap disks − dumb storage nodes w/ 12, 16 disks, no RAID 54
  75. 75. http://danga.com/words/ MogileFS components  clients  mogilefsd (does all real work)  database(s) (MySQL, .... abstract)  storage nodes 55
  76. 76. http://danga.com/words/ MogileFS: Clients  tiny text-based protocol  Libraries available for: − Perl  tied filehandles  MogileFS::Client − my $fh = $mogc->new_file(“key”, [[$class], ...]) − Java − PHP − Python? − porting to $LANG is be trivial − future: no custom protocol. only HTTP  clients don't do database access 56
  77. 77. http://danga.com/words/ MogileFS: Tracker (mogilefsd)  The Meat  event-based message bus  load balances client requests, world info  process manager − heartbeats/watchdog, respawner, ...  Child processes: − ~30x client interface (“query” process)  interfaces client protocol w/ db(s), etc − ~5x replicate − ~2x delete − ~1x fsck, reap, monitor, ..., ... 57
  78. 78. http://danga.com/words/ Trackers' Database(s)  Abstract as of Mogile 2.x − MySQL − SQLite (joke/demo) − Pg/Oracle coming soon? − Also future:  wrapper driver, partitioning any above − small metadata in one driver (MySQL Cluster?), − large tables partitioned over 2-node HA pairs  Recommend config: − 2xMySQL InnoDB on DRBD − 2 slaves underneath HA VIP  1 for backups  read-only slave for during master failover window 58
  79. 79. http://danga.com/words/ MogileFS storage nodes (mogstored)  HTTP transport − GET − PUT − DELETE  mogstored listens on 2 ports...  HTTP. --server={perlbal,lighttpd,...}  configs/manages your webserver of choice.  perlbal is default. some people like apache, etc − management/status:  iostat interface, AIO control, multi-stat() (for faster fsck)  files on filesystem, not DB − sendfile()! future: splice() − filesystem can be any filesystem 59
  80. 80. http://danga.com/words/ Large file GET request 60
  81. 81. http://danga.com/words/ Large file GET request Auth: complex, but quick 60
  82. 82. http://danga.com/words/ Large file GET request Auth: complex, but quick Spoonfeeding: slow, but event- based 60
  83. 83. http://danga.com/words/ Gearman 61
  84. 84. http://danga.com/words/ manaGer 62
  85. 85. http://danga.com/words/ Manager dispatches work, but doesn't do anything useful itself. :) 63
  86. 86. http://danga.com/words/ Gearman  system to load balance function calls...  scatter/gather bunch of calls in parallel,  different languages,  db connection pooling,  spread CPU usage around your network,  keep heavy libraries out of caller code,  ...  ... 64
  87. 87. http://danga.com/words/ Gearman Pieces  gearmand − the function call router − event-loop (epoll, kqueue, etc)  workers. − Gearman::Worker – perl/ruby − register/heartbeat/grab jobs  clients − Gearman::Client[::Async] -- perl − also Ruby Gearman::Client − submit jobs to gearmand − opaque (to server) “funcname” string − optional opaque (to server) “args” string − opt coallescing key 65
  88. 88. http://danga.com/words/ Gearman Picture 66
  89. 89. http://danga.com/words/ Gearman Picture gearmand gearmand gearmand 66
  90. 90. http://danga.com/words/ Gearman Picture Worker Worker gearmand gearmand gearmand 66
  91. 91. http://danga.com/words/ Gearman Picture Worker Worker gearmand gearmand gearmand can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) 66
  92. 92. http://danga.com/words/ Gearman Picture Client Worker Worker gearmand gearmand gearmand can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) 66
  93. 93. http://danga.com/words/ Gearman Picture Client Worker Worker gearmand gearmand gearmand call(“funcA”) can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) 66
  94. 94. http://danga.com/words/ Gearman Picture Client Client Worker Worker gearmand gearmand gearmand call(“funcA”) can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) 66
  95. 95. http://danga.com/words/ Gearman Picture Client Client Worker Worker gearmand gearmand gearmand call(“funcA”) call(“funcB”) can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) 66
  96. 96. http://danga.com/words/ Gearman Protocol  efficient binary protocol  No XML  but also line-based text protocol for admin commands −telnet to gearmand and get status −useful for Nagios plugins, etc 67
  97. 97. http://danga.com/words/ Gearman Uses  Image::Magick outside of your mod_perls!  DBI connection pooling (DBD::Gofer + Gearman)  reducing load, improving visibility  “services” − can all be in different languages, too! 68
  98. 98. http://danga.com/words/ Gearman Uses, cont..  running code in parallel − query ten databases at once  running blocking code from event loops − DBI from POE/Danga::Socket apps  spreading CPU from ev loop daemons  calling between different languages,  ... 69
  99. 99. http://danga.com/words/ Gearman Misc  Guarantees: − none! hah! :)  please wait for your results.  if client goes away, no promises − all retries on failures are done by client  but server will notify client(s) if working worker goes away.  No policy/conventions in gearmand − all policy/meaning between clients <-> workers  ... 70
  100. 100. http://danga.com/words/ Sick Gearman Demo  Don’t actually use it like this... but: use strict; use DMap qw(dmap); DMap->set_job_servers("sammy", "papag"); my @foo = dmap { "$_ = " . `hostname` } (1..10); print "dmap says:n @foo"; $ ./dmap.pl dmap says: 1 = sammy 2 = papag 3 = sammy 4 = papag 5 = sammy 6 = papag 7 = sammy 8 = papag 9 = sammy 10 = papag 71
  101. 101. http://danga.com/words/ Gearman Summary  Gearman is sexy. − especially the coalescing  Check it out! − it's kinda our little unadvertised secret  oh crap, did I leak the secret? 72
  102. 102. http://danga.com/words/ TheSchwartz 73
  103. 103. http://danga.com/words/ TheSchwartz  Like gearman: − job queuing system − opaque function name − opaque “args” blob − clients are either:  submitting jobs  workers  But unlike gearman: − Reliable job queueing system − not low latency − fire & forget (as opposed to gearman, where you wait for result)  currently library, not network service 74
  104. 104. http://danga.com/words/ TheSchwartz Primitives  insert job  “grab” job (atomic grab) − for 'n' seconds.  mark job done  temp fail job for future − optional notes, rescheduling details..  replace job with 1+ other jobs − atomic.  ... 75
  105. 105. http://danga.com/words/ TheSchwartz  backing store: − a database − uses Data::ObjectDriver  MySQL,  Postgres,  SQLite,  ....  but HA: you tell it @dbs, and it finds one to insert job into − likewise, workers foreach (@dbs) to do work 76
  106. 106. http://danga.com/words/ TheSchwartz uses  outgoing email (SMTP client) − millions of emails per day − TheSchwartz::Worker::SendEmail − Email::Send::TheSchwartz  LJ notifications − ESN: event, subscription, notification  one event (new post, etc) -> thousands of emails, SMSes, XMPP messages, etc...  pinging external services  atomstream injection  .....  dozens of users  shared farm for TypePad, Vox, LJ 77
  107. 107. http://danga.com/words/ gearmand + TheSchwartz  gearmand: not reliable, low-latency, no disks  TheSchwartz: latency, reliable, disks  In TypePad: − TheSchwartz, with gearman to fire off TheSchwartz workers.  disks, but low-latency  future: no disks, SSD/Flash, MySQL Cluster 78
  108. 108. http://danga.com/words/ djabberd 79
  109. 109. http://danga.com/words/ djabberd  Our Jabber/XMPP server  powers our “LJ Talk” service  S2S: works with GoogleTalk, etc  perl, event-based (epoll, etc)  done 300,000+ conns  tiny per-conn memory overhead − release XML parser state if possible 80
  110. 110. http://danga.com/words/ djabberd hooks  everything is a hook − not just auth! like, everything. − auth, − roster, − vcard info (avatars), − presence, − delivery, − inter-node cluster delivery, − ala mod_perl, qpsmtpd, etc.  async hooks − hooks phases can take as long as they want before they answer, or decline to next phase in hook chain... − we use Gearman::Client::Async 81
  111. 111. http://danga.com/words/ Thank you! Questions to: brad@danga.com Software: http://danga.com/ http://code.sixapart.com/ 82
  112. 112. http://danga.com/words/ Questions? User DB Cluster 1 uc1a uc1b User DB Cluster 2 uc2a uc2b User DB Cluster 3 uc3a uc3b User DB Cluster N ucNa ucNb Job Queues (xN) jqNa jqNb Memcached mc4 mc3 mc2 mcN ... mc1 mod_perl web4 web3 web2 webN ... web1 BIG-IP bigip2 bigip1 perlbal (httpd/proxy) proxy4 proxy3 proxy2 proxy5 proxy1 Global Database slave1 master_a master_b slave2 ... slave5 MogileFS Database mog_a mog_b Mogile Trackers tracker3tracker1 Mogile Storage Nodes ... sto2 sto8 sto1 net. djabberd djabberd djabberd gearmand gearmand1 gearmandN “workers” gearwrkN theschwkN slave1 slaveN 83
  113. 113. http://danga.com/words/ Bonus Slides  if extra time 84
  114. 114. http://danga.com/words/ Data Integrity  Databases depend on fsync() − but databases can't send raw SCSI/ATA commands to flush controller caches, etc  fsync() almost never works work − Linux, FS' (lack of) barriers, raid cards, controllers, disks, ....  Solution: test! & fix − disk-checker.pl  client/server  spew writes/fsyncs, record intentions on alive machine, yank power, checks. 85
  115. 115. http://danga.com/words/ Persistent Connection Woes  connections == threads == memory − My pet peeve:  want connection/thread distinction in MySQL!  w/ max-runnable-threads tunable  max threads − limit max memory/concurrency  DBD::Gofer + Gearman − Ask  Data::ObjectDriver + Gearman 86
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×