Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The challenges of live events scalability


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The challenges of live events scalability

  2. 2. Hello• I’m Guy Tomer • Founding and working in start-ups for the last 13 years • Founder & CTO of attracTV for the last 4 years• This Presentation is about • Building a scalable system for “a lot” of users • More specifically for handling usage peaks of live TV events on the internet • Even more specifically – how we tackle it as a small start-up
  3. 3. attracTVWeb based self-service solution and tools for managing viewers’ engagement and interaction on the online screen Social Information Advertisement eCommerce
  4. 4. Our Use Case – MTV European MusicAwards• One of the biggest online live streams ever• Can’t expose precise numbers but • 7 digits ( > 1,000,000) – number of streams • 6 digits (> 100,000) – number of concurrent users • 5 digits (> 10,000) – number of users joining every minute at peak• In addition • International event, 20 sites, viewers from >150 countries • 9 languages
  5. 5. What Are The Challenges1. Scaling for these numbers2. Handling very steep ramp-up3. Big data4. High availability5. Testing & preparing for such numbers6. The cost of the above – how to do it and still make moneyWe’ll Discuss mainly 1,5 & 6
  6. 6. Some Big “Internet Scale” Examples• Google Uses About 900,000 Servers• (Map-Reduce) Google completed sorting a ten petabyte input set took 6 hours and 27 minutes to complete on 8000 computers.• Facebook serves 1 trillion pages per months• (2010) 30 billion – Pieces of content (links, notes, photos, etc.) shared on Facebook per month.• (2010) 2 billion – The number of videos watched per day on YouTube.• Akamai, the “CDN to the starts” has 95811 (Q2 2011) servers, 1000 networks, 70 countries
  7. 7. Challenge 1 – Handling The Scale• We are prepared for 400,000 concurrent viewers• HTTP polling every 10<=N<=30 seconds• This means ~20,000 HTTP R/S (requests per second)• For comparison • Stack overflow recently reported 800 R/S • (leading portal in India) reported 3900 R/S • Jobs death resulted in a record breaking 10,000 tweets/s (they do have a lot more requests, that’s just to feel the scale)
  8. 8. What Is Scalability• From Wikipedia“Scalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth.” Performance ≠ ScalabilityThe fact that your code runs very fast forX users doesn’t mean your architecturesupports 100*X users.
  9. 9. Vertical Scalability (scale up)• “Get a bigger server”• “Use faster CPUs”• Cons • Can only help so much (with bad scale/$ value). • A server twice as fast is more than twice as expensive• Pros • Easier to manager less computers • Can use virtualization
  10. 10. Horizontal Scalability (scale out)• “Just add another box” (or another thousand or ...)• Plan the architecture right first, do micro optimizations later• Pros • Unlimited theoretically • Works well with the cloud services elasticity• Cons • More complex to manage • More complex programming models
  11. 11. Challenge #2 – Steep Ramp-up• Live Event - Everyone comes at the same time Steep ramp-up Standard website example (wikimedia)• A car can drive 250k/h doesn’t mean it can do 0- 100km/h in 4 seconds ≠
  12. 12. Challenge #3 – Big Data• From Wikipedia: “Big data are datasets that grow so large that they become awkward to work withusing on-hand database management tools”• One of the biggest hypes in the industry today• During this even we had ~10,000,000 records written to our analytics system per hour• We’re not “Big Data” yet but it’s coming
  13. 13. Challenge #4 – High Availability “High availability refers to a system or component that is continuously operational for a desirably long length of time.”• We need to meet a Service High availability in the cloud Level of 99.9%• Backup, failover systems are expensive• The cloud is at our help
  14. 14. Challenge #5 – Testing• Simulating 100s of thousands of concurrent users… not trivial• Requires 10s of strong servers• Very difficult to collect the data• The cloud is at our help
  15. 15. Challenge #6 – Handling The CostsOf Such Event (Hint- Elasticity)• For production we used ~50 servers that have 4 cores with 2GH and 15GB RAM (m1.xl)• Some options (rough estimation) for this are: • Buy - ~$3500 per box = $175,000. Not for us… • Dedicated server for a month - ~$1000 per instance = $50,000 • VPS (Virtual private server) monthly - ~300$ per box = $15,000• Solution: Cloud on-demand (Amazon AWS) - ~$500 per instance = $25,000 for a month…. BUT … no need to take it for a month, we activate it on demand for 12 hours and it costs $416!
  16. 16. Our #1 Lesson - Think Horizontal!• Why not vertical? • We don’t want it to be our business’s bottleneck at any point in time • We don’t want to buy giant servers • We wanted a cheap start • We want elasticity • We don’t want to buy anything at this point• How? (deserves a separate lecture) • Everything in the architecture • No state shared between the web/app servers • No relation between the # of users and the load on the Database
  17. 17. Lesson #2 KISS• Keep It Simple Stupid• Your system architecture• Your code• Your features Hug out all the complexity in your system• Your business model• If you don’t you won’t scale, from personal experience
  18. 18. Lesson #3 – Load Test Everything, Focus On Real World Usage Patterns• We did massive stress testing• We launched tens of servers just for stress testing• Automated with Jmeter and monitored the same way as production Why?• The only way to test your scaling capabilities• Looking at the code and manual tests are irrelevant• Measure the capacity of a single app server• Test the specific ramp-up scenario because• Example 1 app server = 5000 users, we need to support 200,000 users so we need to prepare at least 40 servers
  19. 19. Lesson #4 – S*t Happens, Don’t Save On Real-Time Monitoring and Support• We had a series of successful big events before this one• We launched tens of servers just for the stress testing• And yet we had two problems during the event  Why?• Murphy is always (eventually) right…• Because of a feature no one uses (see lesson #2 - KISS) that wasn’t active in the stress tests• The specific usage of 9 languages caused unexpected load (see lesson #3 – stress real world scenarios)Luckily the whole team was inmonitoring mode and the issueswere quickly handled on the fly.
  20. 20. Lesson #5 – Use The Cloud (startups)• It’s Elastic, pay on demand• Flexible when you don’t know your parameters• Solution for affordable High Availability & Testing• Focus on development• I am not getting paid by Amazon – check others as well!
  21. 21. Summary - What To Remember?• Scalability is the ability of a system to handle growing amount of work with additional resources• Think horizontal• Keep It Simple (Stupid) – everything• Stress test everything, focus on real world scenarios• Monitor and Real-Time support• Cloud is great for start-ups
  22. 22. The End• Questions? Comments? Consulting Preguntas? 问题 ?• Just Shy? Think you should be working in attracTV? Contact
  23. 23. Special Thanks (presentations, websites I “borrowed” from)• Ask Bjørn Hansen (• High Scalability blog•• Google images• Entourage (