Large platform architecture in (mostly) perl - an illustrated tour

Large platform
architecture in (mostly)
perl - an illustrated
tour
Tomas (t0m) Doran
São Paulo.pm perl workshop 2010
YAPC::EU Pisa 2010

This talk

• Is mostly a ramble
• About what I do for a living
• Good bits
• and bad bits (probably mostly bad bits)
• And when I say ‘illustrated’, I’m not very
good at diagrams, sorry...

Making money from
independent music

• IMPOSSIBLE
• No, no it isn’t. But we’re very lucky to have
people who know the music industry
• A startup would tank
• Last.fm guys “keep losing less money”

The state51 conspiracy
Consolidated Independent
Media Service Provider
• Several (largely proﬁtable) businesses based
on the same technology platform
• East London (Brick Lane), a warehouse.
• > 60% of UK independent content goes
through us somewhere

Being S3 on the cheap
• WAV ﬁles are big.Videos are bigger.
• Transcodes aren’t small, especially when
you have 15 of them.
• My music collection is several hundred
terrabytes
• Need to be able to serve this stuff fast and
concurrently.

MogileFS

• Is free.
• Runs on cheap hardware
• Cheaper then S3.
• Not so awesome if you aren’t Livejournal

Data center design
• 8 amp racks. Seriously, WTF!?!?!
• Electricity is more expensive than servers,
ergo rolling hardware upgrades trivially pay
for themselves.
• Transit is really, really expensive.
• Worth buying ﬁber to other locations to
peer if you need lots of bandwidth.

Platform overview
<VIP> <VIP> <VIP>

Varnish Varnish Varnish

ESI: ESI: ESI:

nginx nginx nginx
Apache + mogile Apache + mogile Apache + mogile
+ custom + custom + custom

FCGI FCGI le FCGI FCGI le FCGI FCGI le
apps auth apps auth apps auth

Also: Encoding (bare metal)
Encoding (VMWare)
Encoding SOAP service
Memcached
Mogile Tracker

Storage
Replication StorageStorage
StorageStorageStorage
MySQL MySQL StorageStorageStorage
Object store Object store StorageStorageStorage
Master Slave StorageStorageStorage
StorageStorage
Storage

Web architecture
• App servers apache, apps FastCGI, port 81
• Varnish + ESI, caching, port 80
• 1 varnish per host, talks to all the apaches
• 1 VIP per host
• Host fail:VIP transfer
• Apache/app fail (or overload), varnish
rebalances/retries.

Web architecture (cont)

• Varnish doesn’t cache media, just provides
failover.
• nginx sends the hit to FastCGI app.
• Returns X-Accel-Redirect.
• nginx talks to MogileFS, handles delivery.

<VIP> <VIP> <VIP>

Varnish Varnish Varnish

ESI: ESI: ESI:

nginx nginx nginx
Apache + mogile Apache + mogile Apache + mogile
+ custom + custom + custom

FCGI FCGI le FCGI FCGI le FCGI FCGI le
apps auth apps auth apps auth

Also: Encoding (bare metal)
Encoding (VMWare)
Encoding SOAP service
Memcached
Mogile Tracker

Storage
Replication StorageStorage
MySQL MySQL StorageStorageStorage
Object store Object store StorageStorageStorage
Master Slave StorageStorageStorage
StorageStorage
Storage

Storage architecture
• Lots of boxes with lots of disk.
• Many additional roles to storage. (Mogile
tracker, memcache node, metal encoding,
VMWare, SOAP Service)
• Not all the boxes do all the roles.
• All the roles can safely fall over and die.
• Which is good, as they do. Or the box falls
over. Or a, then b.

WAV files

• WAV is a container format.
• Loosely defined.
• You can stuff XML documents in WAV files
• Some encoders (oh hai flac) very picky.
• ‘dirty’ and ‘clean’ WAV files.

Transcoding everything

• Lots of different formats
• WMA - GNARGGH%$@*&!!

Win32

• We’re running ActiveState for hysterical
raisins.
• No XS modules
• Thin as possible

Encoding
HTTP Nodes
HTTP Nodes
HTTP Nodes Encoding Service Uploading Service

GET
&
PUT
SOAP
media
Encoder

Downloader Uploader

Win32 &
Local Disk Encoder
(mp3)
Encoder
(wma) Unix

Snakes On A Plane

• SOAP actually works ok here, as we
control both ends.
• Old version of SOAP::Lite
• Wouldn’t recommend interoperating

Logging
• Used to be terribly hard to debug
• Push logs into syslog
• Aggregate in splunk - time correlated from
encoding machines, web service machines,
etc.
• Much easier to work out what happened.

Hardware is shit

• When you have several 100 Tb, undetected
bit error rate of magnetic media is actually
signiﬁcant.
• See also networks, memory, etc.

Things will always fail

• If you need reliability, you have to design it
in from the start.
• Not only will you have (a lot of) hardware
failures, all the software will break in
unexpected ways. Lets not talk about
netotworks..
• Maybe you don’t need this..

Queueing

• We have work queues of different types of
media (e.g. mp3/wma/aac etc)
• In the database.
• Don’t do this.

MySQL sucks

• 1 type of JOIN
• No query rewriting
• Not enough stats for the planner to be
sane

This can hurt
• File Transform table:
• Master (File)
• Result (File)
• Status (pending/complete/failed/running)
• TransformStep (from/to)
• Leads to bad join order, massive fail

How to fail
• SELECT all file transforms that lead to wma
(millions).
• JOIN all files, ever (millions). Filter to find
those in state ‘pending’
• All pending looks like a bad bet - cardinality
of ‘all wmas’ looks better than cardinality of
‘all pending’.
• JOIN in the wrong order, nested loop,
screwed..

Queueing
• Did I mention queues in the DB suck?
• Even if you’re not screwing it up.
• Get a Message Queue (or at least an async
job server)
• If your problem is simple - Gearman.
Harder or you need interop - RabbitMQ.

Mutable state

• Mutable state is the enemy
• Too many things rw.
• No idea how an object got to this state

Anemic domain model
Object-oriented programming (OOP) is a
programming paradigm that uses "objects" –
data structures consisting of data ﬁelds and
methods together with their
interactions – to design applications and
computer programs. Programming techniques
may include features such as data
abstraction, encapsulation, modularity,
polymorphism, and inheritance.

Anemic domain model
• Superset of too much mutable state
• Able to create invalid objects
• Able to make previously valid objects
invalid
• Violation of the encapsulation and
information hiding principles.

scripts

• Lots of our business logic was in scripts
that manipulated objects
• You need people to run scripts (in screen
sessions)
• Ewwww, ewwwww.

Jobs
• Moved to a job based approach
• Jobs started by ﬁle creation, or changing
state of something in a web app
• Jobs sent via message queuing.
• Results go via message queueing
• Jobs trigger other jobs

Jobs Example
• Validate XLS file supplied with order.
• Valid files trigger another job to create
objects for each thing in the XLS
• This then triggers another job to create
transforms, which are then done...
• ... etc ...
• Can’t do this workflow in a web request.

Jobs Future

• More automation of things people run
scripts for.
• Automatic job regeneration (you will lose
messages).

Lava ﬂow

• Old (possibly unclean/invalid) data
• Old (unused/unmaintained) code
• “What harm does it do”

Relational integrity

• Seems to be a pipe dream more often then
not in the real world.
• Why?
• It’s not hard

Data consistency

• This should theoretically be the same thing
as relational integrity.
• In practice...

Mumble View Crap

• Too much logic in templates
• Copy & paste
• Business objects viewed as unchangeable
• Deleted 3000 lines from 2 simple
workﬂows. This ﬁxed a dozen bugs.

Tangram
• No LEFT JOIN
• Displaying a product list becomes an x n
problem.
• OUCH
• Keep stupid - put the entire DB hot in
memcache!

Don’t do web design

• You are a programmer
• Make people pay for a design/CSS/HTML
person
• Work with them
• Be happy

Love your sysadmins
• Help them out.
• Build packages, or local::libs or something
• Keep everything in revision control
• Allow things to be sensibly conﬁgured.
• DOCUMENT THE POSSIBLE SETTINGS
• Use systems management - Puppet?

Love your logs

• Active feedback
• Aggregate in splunk
• Actively prune useless stuff
• Actively add useful stuff after a production
incident

ESI

• Is really awesome
• Make the pain go away
• PURGE requests
• Keep everything hot all the time

memcache everything

• Keep the entire database hot in memcache
• We mostly ask trivial questions, so just
cache those paths.
• 30 Gb of RAM isn’t actually much (3
boxes..)

memcache
• IS A CACHE
• Use sequential port numbers and CNAMES
• E.g. cache0:11210, cache1:11211,
cache2:11212 etc..
• Run several per machine
• Allows you to scale capacity and rebalance
without entire cache ﬂush.

Don’t push bytes

• X-Sendﬁle and X-Accel-Redirect
• I already talked about ﬁle delivery like this
• Using 100Mb of RAM to proxy web
requests does not scale.

Test everything

• Redundant systems need testing
• You’ll still die unexpectedly in production
• If you can manage it, make responsibility for
deployment SEP.

• Thanks for listening
• Questions?

Large platform architecture in (mostly) perl - an illustrated tour

More Related Content

Similar to Large platform architecture in (mostly) perl - an illustrated tour

More from Tomas Doran

Large platform architecture in (mostly) perl - an illustrated tour