Large platform
architecture in (mostly)
  perl - an illustrated
          tour
         Tomas (t0m) Doran
         São Paulo.pm perl workshop 2010
         YAPC::EU Pisa 2010
This talk

• Is mostly a ramble
• About what I do for a living
• Good bits
• and bad bits (probably mostly bad bits)
• And when I say ‘illustrated’, I’m not very
  good at diagrams, sorry...
Making money from
  independent music

• IMPOSSIBLE
• No, no it isn’t. But we’re very lucky to have
  people who know the music industry
• A startup would tank
• Last.fm guys “keep losing less money”
The state51 conspiracy
Consolidated Independent
 Media Service Provider
 • Several (largely profitable) businesses based
   on the same technology platform
 • East London (Brick Lane), a warehouse.
 • > 60% of UK independent content goes
   through us somewhere
Being S3 on the cheap
• WAV files are big.Videos are bigger.
• Transcodes aren’t small, especially when
  you have 15 of them.
• My music collection is several   hundred
  terrabytes
• Need to be able to serve this stuff fast and
  concurrently.
MogileFS

• Is free.
• Runs on cheap hardware
• Cheaper then S3.
• Not so awesome if you aren’t Livejournal
Data center design
• 8 amp racks. Seriously, WTF!?!?!
• Electricity is more expensive than servers,
  ergo rolling hardware upgrades trivially pay
  for themselves.
• Transit is really, really expensive.
• Worth buying fiber to other locations to
  peer if you need lots of bandwidth.
Platform overview
           <VIP>                                  <VIP>                             <VIP>

           Varnish                                Varnish                           Varnish


ESI:                                 ESI:                              ESI:

                       nginx                                  nginx                             nginx
       Apache        + mogile               Apache          + mogile          Apache          + mogile
                     + custom                               + custom                          + custom



       FCGI          FCGI le                  FCGI         FCGI le          FCGI            FCGI le
       apps            auth                    apps           auth            apps              auth




                                                      Also: Encoding (bare metal)
                                                          Encoding (VMWare)
                                                        Encoding SOAP service
                                                              Memcached
                                                             Mogile Tracker

                                                      Storage
                Replication                            StorageStorage
                                                         StorageStorageStorage
  MySQL                          MySQL                     StorageStorageStorage
                                                            StorageStorageStorage
 Object store                   Object store                  StorageStorageStorage
   Master                         Slave                         StorageStorageStorage
                                                                 StorageStorageStorage
                                                                   StorageStorageStorage
                                                                     StorageStorageStorage
                                                                      StorageStorageStorage
                                                                        StorageStorageStorage
                                                                                 StorageStorage
                                                                                         Storage
Web architecture
• App servers apache, apps FastCGI, port 81
• Varnish + ESI, caching, port 80
• 1 varnish per host, talks to all the apaches
• 1 VIP per host
• Host fail:VIP transfer
• Apache/app fail (or overload), varnish
  rebalances/retries.
Web architecture (cont)

• Varnish doesn’t cache media, just provides
  failover.
• nginx sends the hit to FastCGI app.
• Returns X-Accel-Redirect.
• nginx talks to MogileFS, handles delivery.
<VIP>                                  <VIP>                             <VIP>

           Varnish                                Varnish                           Varnish


ESI:                                 ESI:                              ESI:

                       nginx                                  nginx                             nginx
       Apache        + mogile               Apache          + mogile          Apache          + mogile
                     + custom                               + custom                          + custom



       FCGI          FCGI le                  FCGI         FCGI le          FCGI            FCGI le
       apps            auth                    apps           auth            apps              auth




                                                      Also: Encoding (bare metal)
                                                          Encoding (VMWare)
                                                        Encoding SOAP service
                                                              Memcached
                                                             Mogile Tracker

                                                      Storage
                Replication                            StorageStorage
                                                         StorageStorageStorage
  MySQL                          MySQL                     StorageStorageStorage
                                                            StorageStorageStorage
 Object store                   Object store                  StorageStorageStorage
   Master                         Slave                         StorageStorageStorage
                                                                 StorageStorageStorage
                                                                   StorageStorageStorage
                                                                     StorageStorageStorage
                                                                      StorageStorageStorage
                                                                        StorageStorageStorage
                                                                                 StorageStorage
                                                                                         Storage
Storage architecture
• Lots of boxes with lots of disk.
• Many additional roles to storage. (Mogile
  tracker, memcache node, metal encoding,
  VMWare, SOAP Service)
• Not all the boxes do all the roles.
• All the roles can safely fall over and die.
• Which is good, as they do. Or the box falls
  over. Or a, then b.
<VIP>                                  <VIP>                             <VIP>

           Varnish                                Varnish                           Varnish


ESI:                                 ESI:                              ESI:

                       nginx                                  nginx                             nginx
       Apache        + mogile               Apache          + mogile          Apache          + mogile
                     + custom                               + custom                          + custom



       FCGI          FCGI le                  FCGI         FCGI le          FCGI            FCGI le
       apps            auth                    apps           auth            apps              auth




                                                      Also: Encoding (bare metal)
                                                          Encoding (VMWare)
                                                        Encoding SOAP service
                                                              Memcached
                                                             Mogile Tracker

                                                      Storage
                Replication                            StorageStorage
                                                         StorageStorageStorage
  MySQL                          MySQL                     StorageStorageStorage
                                                            StorageStorageStorage
 Object store                   Object store                  StorageStorageStorage
   Master                         Slave                         StorageStorageStorage
                                                                 StorageStorageStorage
                                                                   StorageStorageStorage
                                                                     StorageStorageStorage
                                                                      StorageStorageStorage
                                                                        StorageStorageStorage
                                                                                 StorageStorage
                                                                                         Storage
WAV files

• WAV is a container format.
• Loosely defined.
• You can stuff XML documents in WAV files
• Some encoders (oh hai flac) very picky.
• ‘dirty’ and ‘clean’ WAV files.
Transcoding everything


• Lots of different formats
• WMA - GNARGGH%$@*&!!
Win32

• We’re running ActiveState for hysterical
  raisins.
• No XS modules
• Thin as possible
Encoding
HTTP Nodes
 HTTP Nodes
  HTTP Nodes           Encoding Service        Uploading Service




    GET
     &
    PUT
                                       SOAP
                                                                    media
                   Encoder



     Downloader                  Uploader

                                                                   Win32 &
      Local Disk             Encoder
                              (mp3)
                                          Encoder
                                           (wma)                    Unix
Snakes On A Plane

• SOAP actually works ok here, as we
  control both ends.
• Old version of SOAP::Lite
• Wouldn’t recommend interoperating
Logging
• Used to be terribly hard to debug
• Push logs into syslog
• Aggregate in splunk - time correlated from
  encoding machines, web service machines,
  etc.
• Much easier to work out what happened.
Hardware is shit

• When you have several 100 Tb, undetected
  bit error rate of magnetic media is actually
  significant.
• See also networks, memory, etc.
Things will always fail

• If you need reliability, you have to design it
  in from the start.
• Not only will you have (a lot of) hardware
  failures, all the software will break in
  unexpected ways. Lets not talk about
  netotworks..
• Maybe you don’t need this..
Queueing

• We have work queues of different types of
  media (e.g. mp3/wma/aac etc)
• In the database.
• Don’t do this.
MySQL sucks

• 1 type of JOIN
• No query rewriting
• Not enough stats for the planner to be
  sane
This can hurt
• File Transform table:
 • Master (File)
 • Result (File)
 • Status (pending/complete/failed/running)
 • TransformStep (from/to)
• Leads to bad join order, massive fail
MySQL sucks

       FAIL
How to fail
• SELECT all file transforms that lead to wma
  (millions).
• JOIN all files, ever (millions). Filter to find
  those in state ‘pending’
• All pending looks like a bad bet - cardinality
  of ‘all wmas’ looks better than cardinality of
  ‘all pending’.
• JOIN in the wrong order, nested loop,
  screwed..
Queueing
• Did I mention queues in the DB suck?
• Even if you’re not screwing it up.
• Get a Message Queue (or at least an async
  job server)
• If your problem is simple - Gearman.
  Harder or you need interop - RabbitMQ.
Mutable state

• Mutable state is the enemy
• Too many things rw.
• No idea how an object got to this state
Anemic domain model
  Object-oriented programming (OOP) is a
 programming paradigm that uses "objects" –
 data structures consisting of data fields and
      methods together with their
 interactions – to design applications and
computer programs. Programming techniques
     may include features such as data
abstraction, encapsulation, modularity,
       polymorphism, and inheritance.
Anemic domain model
• Superset of too much mutable state
• Able to create invalid objects
• Able to make previously valid objects
  invalid
• Violation of the encapsulation and
  information hiding principles.
scripts

• Lots of our business logic was in scripts
  that manipulated objects
• You need people to run scripts (in screen
  sessions)
• Ewwww, ewwwww.
Jobs
• Moved to a job based approach
• Jobs started by file creation, or changing
  state of something in a web app
• Jobs sent via message queuing.
• Results go via message queueing
• Jobs trigger other jobs
Jobs Example
• Validate XLS file supplied with order.
• Valid files trigger another job to create
  objects for each thing in the XLS
• This then triggers another job to create
  transforms, which are then done...
• ... etc ...
• Can’t do this workflow in a web request.
Jobs Future

• More automation of things people run
  scripts for.
• Automatic job regeneration (you will lose
  messages).
Lava flow

• Old (possibly unclean/invalid) data
• Old (unused/unmaintained) code
• “What harm does it do”
Relational integrity

• Seems to be a pipe dream more often then
  not in the real world.
• Why?
• It’s not hard
Data consistency


• This should theoretically be the same thing
  as relational integrity.
• In practice...
Mumble View Crap

• Too much logic in templates
• Copy & paste
• Business objects viewed as unchangeable
• Deleted 3000 lines from 2 simple
  workflows. This fixed a dozen bugs.
Tangram
• No LEFT JOIN
• Displaying a product list becomes an x n
  problem.
• OUCH
• Keep stupid - put the entire DB hot in
  memcache!
Don’t do web design

• You are a programmer
• Make people pay for a design/CSS/HTML
  person
• Work with them
• Be happy
Love your sysadmins
• Help them out.
• Build packages, or local::libs or something
• Keep everything in revision control
• Allow things to be sensibly configured.
• DOCUMENT THE POSSIBLE SETTINGS
• Use systems management - Puppet?
Love your logs

• Active feedback
• Aggregate in splunk
• Actively prune useless stuff
• Actively add useful stuff after a production
  incident
ESI

• Is really awesome
• Make the pain go away
• PURGE requests
• Keep everything hot all the time
memcache everything

• Keep the entire database hot in memcache
• We mostly ask trivial questions, so just
  cache those paths.
• 30 Gb of RAM isn’t actually much (3
  boxes..)
memcache
• IS A CACHE
• Use sequential port numbers and CNAMES
• E.g. cache0:11210, cache1:11211,
  cache2:11212 etc..
• Run several per machine
• Allows you to scale capacity and rebalance
  without entire cache flush.
Don’t push bytes

• X-Sendfile and X-Accel-Redirect
• I already talked about file delivery like this
• Using 100Mb of RAM to proxy web
  requests does not scale.
Test everything

• Redundant systems need testing
• You’ll still die unexpectedly in production
• If you can manage it, make responsibility for
  deployment SEP.
• Thanks for listening
• Questions?

Large platform architecture in (mostly) perl

  • 1.
    Large platform architecture in(mostly) perl - an illustrated tour Tomas (t0m) Doran São Paulo.pm perl workshop 2010 YAPC::EU Pisa 2010
  • 2.
    This talk • Ismostly a ramble • About what I do for a living • Good bits • and bad bits (probably mostly bad bits) • And when I say ‘illustrated’, I’m not very good at diagrams, sorry...
  • 3.
    Making money from independent music • IMPOSSIBLE • No, no it isn’t. But we’re very lucky to have people who know the music industry • A startup would tank • Last.fm guys “keep losing less money”
  • 4.
    The state51 conspiracy ConsolidatedIndependent Media Service Provider • Several (largely profitable) businesses based on the same technology platform • East London (Brick Lane), a warehouse. • > 60% of UK independent content goes through us somewhere
  • 5.
    Being S3 onthe cheap • WAV files are big.Videos are bigger. • Transcodes aren’t small, especially when you have 15 of them. • My music collection is several hundred terrabytes • Need to be able to serve this stuff fast and concurrently.
  • 6.
    MogileFS • Is free. •Runs on cheap hardware • Cheaper then S3. • Not so awesome if you aren’t Livejournal
  • 7.
    Data center design •8 amp racks. Seriously, WTF!?!?! • Electricity is more expensive than servers, ergo rolling hardware upgrades trivially pay for themselves. • Transit is really, really expensive. • Worth buying fiber to other locations to peer if you need lots of bandwidth.
  • 8.
    Platform overview <VIP> <VIP> <VIP> Varnish Varnish Varnish ESI: ESI: ESI: nginx nginx nginx Apache + mogile Apache + mogile Apache + mogile + custom + custom + custom FCGI FCGI le FCGI FCGI le FCGI FCGI le apps auth apps auth apps auth Also: Encoding (bare metal) Encoding (VMWare) Encoding SOAP service Memcached Mogile Tracker Storage Replication StorageStorage StorageStorageStorage MySQL MySQL StorageStorageStorage StorageStorageStorage Object store Object store StorageStorageStorage Master Slave StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorage Storage
  • 9.
    Web architecture • Appservers apache, apps FastCGI, port 81 • Varnish + ESI, caching, port 80 • 1 varnish per host, talks to all the apaches • 1 VIP per host • Host fail:VIP transfer • Apache/app fail (or overload), varnish rebalances/retries.
  • 10.
    Web architecture (cont) •Varnish doesn’t cache media, just provides failover. • nginx sends the hit to FastCGI app. • Returns X-Accel-Redirect. • nginx talks to MogileFS, handles delivery.
  • 11.
    <VIP> <VIP> <VIP> Varnish Varnish Varnish ESI: ESI: ESI: nginx nginx nginx Apache + mogile Apache + mogile Apache + mogile + custom + custom + custom FCGI FCGI le FCGI FCGI le FCGI FCGI le apps auth apps auth apps auth Also: Encoding (bare metal) Encoding (VMWare) Encoding SOAP service Memcached Mogile Tracker Storage Replication StorageStorage StorageStorageStorage MySQL MySQL StorageStorageStorage StorageStorageStorage Object store Object store StorageStorageStorage Master Slave StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorage Storage
  • 12.
    Storage architecture • Lotsof boxes with lots of disk. • Many additional roles to storage. (Mogile tracker, memcache node, metal encoding, VMWare, SOAP Service) • Not all the boxes do all the roles. • All the roles can safely fall over and die. • Which is good, as they do. Or the box falls over. Or a, then b.
  • 13.
    <VIP> <VIP> <VIP> Varnish Varnish Varnish ESI: ESI: ESI: nginx nginx nginx Apache + mogile Apache + mogile Apache + mogile + custom + custom + custom FCGI FCGI le FCGI FCGI le FCGI FCGI le apps auth apps auth apps auth Also: Encoding (bare metal) Encoding (VMWare) Encoding SOAP service Memcached Mogile Tracker Storage Replication StorageStorage StorageStorageStorage MySQL MySQL StorageStorageStorage StorageStorageStorage Object store Object store StorageStorageStorage Master Slave StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorageStorage StorageStorage Storage
  • 14.
    WAV files • WAVis a container format. • Loosely defined. • You can stuff XML documents in WAV files • Some encoders (oh hai flac) very picky. • ‘dirty’ and ‘clean’ WAV files.
  • 15.
    Transcoding everything • Lotsof different formats • WMA - GNARGGH%$@*&!!
  • 16.
    Win32 • We’re runningActiveState for hysterical raisins. • No XS modules • Thin as possible
  • 17.
    Encoding HTTP Nodes HTTPNodes HTTP Nodes Encoding Service Uploading Service GET & PUT SOAP media Encoder Downloader Uploader Win32 & Local Disk Encoder (mp3) Encoder (wma) Unix
  • 18.
    Snakes On APlane • SOAP actually works ok here, as we control both ends. • Old version of SOAP::Lite • Wouldn’t recommend interoperating
  • 19.
    Logging • Used tobe terribly hard to debug • Push logs into syslog • Aggregate in splunk - time correlated from encoding machines, web service machines, etc. • Much easier to work out what happened.
  • 20.
    Hardware is shit •When you have several 100 Tb, undetected bit error rate of magnetic media is actually significant. • See also networks, memory, etc.
  • 21.
    Things will alwaysfail • If you need reliability, you have to design it in from the start. • Not only will you have (a lot of) hardware failures, all the software will break in unexpected ways. Lets not talk about netotworks.. • Maybe you don’t need this..
  • 22.
    Queueing • We havework queues of different types of media (e.g. mp3/wma/aac etc) • In the database. • Don’t do this.
  • 23.
    MySQL sucks • 1type of JOIN • No query rewriting • Not enough stats for the planner to be sane
  • 24.
    This can hurt •File Transform table: • Master (File) • Result (File) • Status (pending/complete/failed/running) • TransformStep (from/to) • Leads to bad join order, massive fail
  • 25.
  • 26.
    How to fail •SELECT all file transforms that lead to wma (millions). • JOIN all files, ever (millions). Filter to find those in state ‘pending’ • All pending looks like a bad bet - cardinality of ‘all wmas’ looks better than cardinality of ‘all pending’. • JOIN in the wrong order, nested loop, screwed..
  • 27.
    Queueing • Did Imention queues in the DB suck? • Even if you’re not screwing it up. • Get a Message Queue (or at least an async job server) • If your problem is simple - Gearman. Harder or you need interop - RabbitMQ.
  • 28.
    Mutable state • Mutablestate is the enemy • Too many things rw. • No idea how an object got to this state
  • 29.
    Anemic domain model Object-oriented programming (OOP) is a programming paradigm that uses "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction, encapsulation, modularity, polymorphism, and inheritance.
  • 30.
    Anemic domain model •Superset of too much mutable state • Able to create invalid objects • Able to make previously valid objects invalid • Violation of the encapsulation and information hiding principles.
  • 31.
    scripts • Lots ofour business logic was in scripts that manipulated objects • You need people to run scripts (in screen sessions) • Ewwww, ewwwww.
  • 32.
    Jobs • Moved toa job based approach • Jobs started by file creation, or changing state of something in a web app • Jobs sent via message queuing. • Results go via message queueing • Jobs trigger other jobs
  • 33.
    Jobs Example • ValidateXLS file supplied with order. • Valid files trigger another job to create objects for each thing in the XLS • This then triggers another job to create transforms, which are then done... • ... etc ... • Can’t do this workflow in a web request.
  • 34.
    Jobs Future • Moreautomation of things people run scripts for. • Automatic job regeneration (you will lose messages).
  • 35.
    Lava flow • Old(possibly unclean/invalid) data • Old (unused/unmaintained) code • “What harm does it do”
  • 36.
    Relational integrity • Seemsto be a pipe dream more often then not in the real world. • Why? • It’s not hard
  • 37.
    Data consistency • Thisshould theoretically be the same thing as relational integrity. • In practice...
  • 38.
    Mumble View Crap •Too much logic in templates • Copy & paste • Business objects viewed as unchangeable • Deleted 3000 lines from 2 simple workflows. This fixed a dozen bugs.
  • 39.
    Tangram • No LEFTJOIN • Displaying a product list becomes an x n problem. • OUCH • Keep stupid - put the entire DB hot in memcache!
  • 40.
    Don’t do webdesign • You are a programmer • Make people pay for a design/CSS/HTML person • Work with them • Be happy
  • 41.
    Love your sysadmins •Help them out. • Build packages, or local::libs or something • Keep everything in revision control • Allow things to be sensibly configured. • DOCUMENT THE POSSIBLE SETTINGS • Use systems management - Puppet?
  • 42.
    Love your logs •Active feedback • Aggregate in splunk • Actively prune useless stuff • Actively add useful stuff after a production incident
  • 43.
    ESI • Is reallyawesome • Make the pain go away • PURGE requests • Keep everything hot all the time
  • 44.
    memcache everything • Keepthe entire database hot in memcache • We mostly ask trivial questions, so just cache those paths. • 30 Gb of RAM isn’t actually much (3 boxes..)
  • 45.
    memcache • IS ACACHE • Use sequential port numbers and CNAMES • E.g. cache0:11210, cache1:11211, cache2:11212 etc.. • Run several per machine • Allows you to scale capacity and rebalance without entire cache flush.
  • 46.
    Don’t push bytes •X-Sendfile and X-Accel-Redirect • I already talked about file delivery like this • Using 100Mb of RAM to proxy web requests does not scale.
  • 47.
    Test everything • Redundantsystems need testing • You’ll still die unexpectedly in production • If you can manage it, make responsibility for deployment SEP.
  • 48.
    • Thanks forlistening • Questions?