2. # who’s talking
leonidas tsementzis
aka @goldstein
* software architect, engineer
[all major web/mobile platforms]
* devOps
[enthusiast, not a real sysadmin]
* entrepreneur
[n00b]
CTO Mobile architect CTO
3. # the high-level requirements
* 2007
* take sport.gr to the next level...
* ... make sure it works smoothly...
* ... and fast enough
4. # i can see clearly now :)
* videos
[goals, match coverage]
* comments
[the blogging age, remember?]
* live streaming
[ustream does not exist, yet]
* live coverage of events
[cover it live does not exist, yet]
* user-centric design
[personalization, ratings]
* even more videos
[I can haz more LOLCats]
5. # the problem :(
* we are planning for a 150% traffic growth but
[6 months planning ahead]
$ video costs
[bandwidth cost: 1€/GB]
* comments costs
[DB writes, CPU, disk i/o]
* live streaming costs
[bandwidth cost: 1€/GB]
* limited iron resources, not happy with our current host
[dedicated managed servers in top GR Datacenter]
6. # S3 to the rescue
* 87% cost reduction
[0.13€/GB VS 1€/GB]
* made videos section possible...
* ...and advertisers loved it ($$$+)
* first GR site to focus on video, key competitive advantage
* 6TB video traffic in the first month
* hired a video editing team to support the demand
7. # EC2 servers on demand
* 3x(n) Application servers for the main website
[Windows 2003, IIS 6]
* 2x(n) Application servers for APIs
[Windows 2003, IIS 6]
* 2x(n) Servers for banner managers
[CentOS, Apache, OpenX]
* 1x Storage server
* 2x Database servers
[MS SQL Server 2008 with failover]
* 2x Reverse Proxy cache servers
[Squid]
* 2x Load Balancers
[HAProxy with failover]
* 1x monitoring server
[munin with a lot of custom plugins]
9. # a typical week
* peaks at 3k hits/sec once or twice/week
* normal ratio at 300 hits/sec
* you can’t afford the 1st
* you can’t deliver on the 2nd
10. # auto-scaling to the rescue
* if average CPU usage grows over 60% for 2 minutes, add
another application server
* if average CPU usage falls below 30% for 5 minutes, kill
gracefully an application server
* 20 instances on peaks
* 3 instances (minimum) on normal operations
* no more “Server is busy” errors
* pay only what you (really) need
* you can now sleep at nights
* 60% overall cost reduction
11. # wait, there’s more!
* CDN & media streaming with CloudFront
* use multiple CNAMES with CloudFront to boost HTTP
requests
[as per YSlow recommends]
* CloudFront custom domains are sexy
* robust DNS with Route 53
* simple monitoring with CloudWatch
[you still need an external monitoring tool]
13. # lessons learned
* test, iterate, test, iterate
* reserved instances saves you $$
* EC2 is a hacker playground
[prepare for DOS attacks]
* backup entire AMIs to S3
[instances *WILL* #FAIL]
* EBS disk I/O is slow, but amazon is working on this
[problems with DB writes]
* spawning new instances is slow
[15 mins provisioning can be a show stopper on scaling]
* S3 uploads/downloads are slow
* sticky session is a must
[we replaced AWS ELB with HAProxy just for this]
* SLAs can't guarantee high availability
[AWS *WILL* #FAIL]
14. # more lessons learned
* devOps are hard to find
[interested? I’m hiring]
* automate everything
[makes you sleep at night]
* monitor everything
[munin is your friend]
* disaster prevention
[work *ALWAYS* around the worst case scenario]
* windows server administration is a mess
[and AWS is not making this prettier]
* DB scale is the hardest part
[code changes]
* legacy software *IS* a problem
** on scaling
** on hiring
** on growing (have you tried to use XMPP via ASP?)
15. # AWS is not perfect
* AKAMAI is still faster compared to CloudFront
[especially in Greece]
* not affordable for large architectures
[if you’re running 300+ instances, you should consider making your own
datacenter]