Building a scalable online backup system in python
1. Building a Scalable Online Backup System in Python Joe Drumgoole http://twitter/jdrumgoole
2. Scaling You probably shouldn’t care Throughput vs response time Scaling is a fractal problem The database is what will get ya! Amazing what a well tuned DB will support http://twitter.com/jdrumgoole 2
4. Online Backup Not really a Web 2.0 play More like client server Larger vision of PutPlace Map of your Digital World http://twitter.com/jdrumgoole 4
5. Online Backup : Client Installation and support of Windows 20** Mac Support Open file/locked file handling Bandwidth throttling CPU Throttling Upload restarts Feedback http://twitter.com/jdrumgoole 5
6. Online Backup : Server Don’t loose any files De-duplication Thumbnail generation for images Flickr Backup Client Feedback Bulk download File relationships http://twitter.com/jdrumgoole 6
7. Online Backup - Secrets People Don’t backup Compute dominates Restores represent 0.01% of bandwidth and load Writing web clones of Windows Explorer is hard The browser sucks as a client side app container (for now) http://twitter.com/jdrumgoole 7
8. Scaling For online backup the challenge is to receive shed loads of data from lots of clients Clients upload in 1MB chunks Chunks must be stored coalesced and push to stable backup (S3) Clients must get acknowledgement Web page must update Quota management http://twitter.com/jdrumgoole 8
9. Load Balancer Load Balancer : Perlbal http://www.danga.com/perlbal/ Can handle 100 x millions of requests per day Event based (sshhh : Don’t tell anyone, but its Perl!) It does fall over occasionally Otherwise works perfectly http://twitter.com/jdrumgoole 9
10. App Server Our app servers: Handle login Deliver web pages Handle uploads from clients Hand off heavy duty processing to task servers Thumbnail generation File coalescing Checksum generation Hand off is via a database queue http://twitter.com/jdrumgoole 10
11. App Server Just Django Instances Templates deliver web pages Views handle chunks/login etc. Models update the database Task Servers do the heavy lifting http://twitter.com/jdrumgoole 11
12. Task Server Run off a database queue (table) Four main task servers: Assemble completed file uploads Create thumbnails Remove deleted files Generate user statistics Servers are multi-threaded http://twitter.com/jdrumgoole 12
13. Refactoring Originally N blacknight servers writing to NFS Then N blacknight servers writing to S3 Then N EC2 servers writing to S3 The N EC2 servers writing to MogileFS/S3 Lots of uploading optimisations along the way http://twitter.com/jdrumgoole 13
14. Results System has successfully uploaded over 100k files in a single day Regularily does 50k files a day Have about 2k registered users Continues to get registrations Runs in lights out mode (no daily/weekly/monthly housekeeping) http://twitter.com/jdrumgoole 14
15. What worked Python proved extremely flexible Standard library saved us lots of work Django provided a lot of glue Easy to migrate from dedicated host on NFS to Cloud Hosting and S3 storage Nagios/Monitis monitoring http://twitter.com/jdrumgoole 15
16. What Didn’t Work Would use MySQL rather than Postgres Easier to cluster, more knowledge available Native Windows Client Unecessary, Python client was good enough Would use an off the shelf queueing system RabbitMQ, ActiveMQ, SQS Kludgey client side API Threading The Client http://twitter.com/jdrumgoole 16
17. Tool Chain Wush.net : Subversion and Trac DynDNS: Dynamic DNS Python/Django: Dev Stack Postgres: Database Hudson : Build Server Perlbal: Load Balancing MogileFS : Distributed File System Memcached : Caching Nagios, Monitis: Monitoring Hamachi : VPN through Firewall Google Apps : Email, Calendar, Docs, Wiki AuthSMTP : Validated SMTP Zendesk: Support Desk Amazon : Storage, Compute, Bandwidth Paypal : Billing http://twitter.com/jdrumgoole 17
18. Costs Capital Expenditure One server 5k euro One laptop per developer 2.5k (7 devs) One Linksys WIFI/Firewall (won at Raffle) Two 24 port switches 1.6k Total: ~24k Running Costs for Grid and Storage ~1800 euro a month (8 instances) http://twitter.com/jdrumgoole 18
19. If I Were Doing it Again Stick with native python client Look at eventing ala Node.js for server Use MySQL Use Google App Engine as Front End/Load Balancer Use a commercial queueing package http://twitter.com/jdrumgoole 19