Server architecture & scaling strategy for a sports website

server architecture,
availability & scaling strategy

CTO
leonidas tsementzis

# who’s talking

leonidas tsementzis
aka @goldstein

* software architect, engineer
[all major web/mobile platforms]

* devOps
[enthusiast, not a real sysadmin]

* entrepreneur
[n00b]

CTO Mobile architect CTO

# the high-level requirements

* 2007
* take sport.gr to the next level...
* ... make sure it works smoothly...
* ... and fast enough

# i can see clearly now :)

* videos
[goals, match coverage]

* comments
[the blogging age, remember?]

* live streaming
[ustream does not exist, yet]

* live coverage of events
[cover it live does not exist, yet]

* user-centric design
[personalization, ratings]

* even more videos
[I can haz more LOLCats]

# the problem :(

* we are planning for a 150% traffic growth but
[6 months planning ahead]

$ video costs
[bandwidth cost: 1€/GB]

* comments costs
[DB writes, CPU, disk i/o]

* live streaming costs
[bandwidth cost: 1€/GB]

* limited iron resources, not happy with our current host
[dedicated managed servers in top GR Datacenter]

# S3 to the rescue
* 87% cost reduction
[0.13€/GB VS 1€/GB]
* made videos section possible...
* ...and advertisers loved it ($$$+)
* first GR site to focus on video, key competitive advantage
* 6TB video traffic in the first month
* hired a video editing team to support the demand

# EC2 servers on demand
* 3x(n) Application servers for the main website
[Windows 2003, IIS 6]

* 2x(n) Application servers for APIs
[Windows 2003, IIS 6]

* 2x(n) Servers for banner managers
[CentOS, Apache, OpenX]

* 1x Storage server
* 2x Database servers
[MS SQL Server 2008 with failover]

* 2x Reverse Proxy cache servers
[Squid]

* 2x Load Balancers
[HAProxy with failover]

* 1x monitoring server
[munin with a lot of custom plugins]

# a nice headache

:( :( :’(

# a typical week

* peaks at 3k hits/sec once or twice/week
* normal ratio at 300 hits/sec
* you can’t afford the 1st
* you can’t deliver on the 2nd

# auto-scaling to the rescue

* if average CPU usage grows over 60% for 2 minutes, add
another application server
* if average CPU usage falls below 30% for 5 minutes, kill
gracefully an application server
* 20 instances on peaks
* 3 instances (minimum) on normal operations
* no more “Server is busy” errors
* pay only what you (really) need
* you can now sleep at nights
* 60% overall cost reduction

# wait, there’s more!

* CDN & media streaming with CloudFront
* use multiple CNAMES with CloudFront to boost HTTP
requests
[as per YSlow recommends]
* CloudFront custom domains are sexy
* robust DNS with Route 53
* simple monitoring with CloudWatch
[you still need an external monitoring tool]

# SUM()

* S3 * CloudFront
Photos Video streaming
Videos CDN
Static banners

* ELB
* EC2 Load balancing
Main website
SQL Databases RDS
*
Backoffice MySQL databases
APIs
Banner Managers
* Route 53
Cache servers
Load Balancers DNS Resolution

* CloudWatch
Auto-scaling
Simple Monitoring

# lessons learned

* test, iterate, test, iterate
* reserved instances saves you $$
* EC2 is a hacker playground
[prepare for DOS attacks]

* backup entire AMIs to S3
[instances *WILL* #FAIL]

* EBS disk I/O is slow, but amazon is working on this
[problems with DB writes]

* spawning new instances is slow
[15 mins provisioning can be a show stopper on scaling]

* S3 uploads/downloads are slow
* sticky session is a must
[we replaced AWS ELB with HAProxy just for this]

* SLAs can't guarantee high availability
[AWS *WILL* #FAIL]

# more lessons learned

* devOps are hard to find
[interested? I’m hiring]

* automate everything
[makes you sleep at night]

* monitor everything
[munin is your friend]

* disaster prevention
[work *ALWAYS* around the worst case scenario]

* windows server administration is a mess
[and AWS is not making this prettier]

* DB scale is the hardest part
[code changes]

* legacy software *IS* a problem
** on scaling
** on hiring
** on growing (have you tried to use XMPP via ASP?)

# AWS is not perfect

* AKAMAI is still faster compared to CloudFront
[especially in Greece]

* not affordable for large architectures
[if you’re running 300+ instances, you should consider making your own
datacenter]

# questions? challenges?

?
@goldstein
aka leonidas tsementzis

leotsem [at] gmail.com

# thank you

!
@goldstein
aka leonidas tsementzis

leotsem [at] gmail.com

Server architecture & scaling strategy for a sports website

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Server architecture & scaling strategy for a sports website

Similar to Server architecture & scaling strategy for a sports website (20)

Recently uploaded

Recently uploaded (20)

Server architecture & scaling strategy for a sports website