How To Develop Innovative,
Scalable Systems?
Chat System
How hard can it be?
• Authentication
• Slowmode
• Moderator
• Admins
• Subscribers
• Timeout
• Ban
• Limits per user
• Notify messages
• Image...
Realtime
No IRC!
WebSocket
Web Server
UI
Let’s try this
PHP
Mysql
Auto scaling
Less than 2k
Back to the drawing Board!
WebSocket Load
balancing
Permission and
security model
(Admin, Mods, ...)
Frontend Server Backend Server
UI
Ok, so let’s t...
• Small, cheap machines
• Handle the connections, no logic
• When it breaks it breaks only for a few user
• Automatic Fail...
• Small, cheap machines
• Handles all the logic
• Stateless, can be restarted/upgraded any time
• Easy expandable with new...
• Fast
• I mean, REALLY fast!
• You can cluster it
• Easy to back up
Redis
No single point of failure
Websockets...
WebSocket Load
balancing
Permission and
security model
(Admin, Mods, ...)
Frontend Server Backend Server
UI
Ok, let’s fix ...
Script Kiddies
Validate everything!
• Frontend servers report CPU load every 10 Seconds
• Lowest X frontend servers are send to the UI
• UI selects a frontend...
2000 Messages/seconds
Async.js for the win!
„Self“ DDOS
Cache early, Cache often!
Stupid Software Design
Sometimes Realtime is bad!
Monitor Everything!
Etsy‘s statsd
So, is it Working?
Thank you!
max@hitbox.tv
We are hiring!
jobs.hitbox.tv
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
How to develop innovative, scalable systems
Upcoming SlideShare
Loading in …5
×

How to develop innovative, scalable systems

1,402 views

Published on

My talk at the wearedevelopers.org conference and at the ViennaJS meetup

Published in: Internet
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,402
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • What i want to talk about
  • Thats me, 1980/81 with my first computer, anyone know the computer? I have studed arts, lived in new york & berlin, have made startups and have crashed startups
  • What is hitbox? This is the frontpage
  • This is a streamer, he plays games and streams them. Most of them are also entertainer, making money wiht advertising & subscriptions

    6 M uniques/month, number 2 in the world
  • Sounds easy, or? Exists since 30 years.
  • So, how hard can it be?
  • Lot of things to do! And thats just the beginning!
  • Most important is realtime, you write something and all others should see it as fast as possible.
  • For example, he dances (and lost 20 kg this way) and people cheer him up in the chat.
  • So back to the chat, IRC is a protocol that is used since 30 years, we wanted to make something new, something modern, something without netsplits, etc.
  • We started with this because our backend is already in php, lets see if this works out!

  • Easy setup:
    And mysql as database.
    Well, these two sentences tell you already all of the problems....
  • Imagine „Long running php process to server multiple websocket connections“
  • It worked for up to 2000 connections, not very scaleable!
  • So back to the drawing board, we wanted something modern, so lets use modern software!
  • We went with nodejs & redis, anyone here has experience with nodejs as servers?
  • We use a two way setup:
    Frontend servers and backendservers and redis as a data storage. If we loose the redis data we just loose who is in what chatroom, just press f5 and you are back in.
  • We use AWS
    Single core machines
  • Same machines as for frontends
  • I can only recommend it, i never saw a redis instance failing (except for getting slow)
  • So, looks like a perfect system, lets code it!
    We did and...
  • it worked!
    So we could party!
  • Not so fast!!!!!!!
    There we had our first problem, something everyone should support
  • Its a fucking standard!
    But there are firewalls that block it, there are mobile devices that block it or even worse, tell you that a websocket connection is working but it isnt, they just lie to you!
    0,5-1% have this problem, but they where mailing us like hell....
  • 0,5-1%
    So we had to use fallback servers for long polling. Long polling means a lot of overhead from the http-protocoll, so these servers can handle only 1/10 of the normal frontend server, but it works!

  • So we thought we can party again.
  • Well, the hitbox audience is young, so they try a lot... You wont imagine how often we get ddosed or people try to abuse the api....

    And last yeatr, someone managed it:
  • It was during at that time biggest event ever, 60k people on one stream and suddenly all of them saw this.
  • And we did this!
  • Well, they did not managed to break our system or steal any userdata, the only think they did was insert in the „nameColor“ some javascript, and we did not validated it. We validated everything else, but not this one, because it is only a number...

    so
  • Really, everything, really, really everything!
    Again, we thought we can party!
  • But.... Then others came and did this
  • A websocket DDOS! Sending massive amounts of join commands to the chat.
    So we had to think about how we can distribute this load better or make it harder for them to reeach all frontend server, remember, they are up & down scaling automatic.
  • So this is our way how we do load balancing on the frontend servers, works really good.

    If they ddos a few servers this servers will not get new connections and from the upscale we get new servers that are not ddosed.

    Why the random factor in the ui? F5, more on this later.
    So once again we party hard!
  • Until he came
  • Rezigiusz a polish Youtuber & streamer with a lot of fans that love to type
    Think of it as one direction of poland
  • When he is streaming he has around 1-15k viewers and they type 2000 messages a second into the chat!
    1995 get blocked, but the backend servers have to check this....

    So the event loop of nodejs exploded....
  • But, using async.js, whic is a great tool to queue work we could clean up the event loop, delaying some messages a few milliseconds but letting the main tasks working fine
  • So for example we made queues for the most important function, login, logout, chatmsg, etc.

    So, we can party again!
  • But, dont forget one of the biggest problems you can run into...
  • I know, this sound s stupid, but i will give you two examples:
    Imagine you have a stream with 100k viewers. Every time a new viewer comes to this stream he/she gets the info about how to get the stream from our server.
    Now imagine the streamer has a problem, lets say his computer crashes and the stream drops, mean is getting black or stucked.
    What does 100k people do?
  • This.
    And lets hope that your api can handle this!
    And they wont stop until trhe have a stream again!
  • We learned a lot about caching, otherwise you cannot handle this, memcache & redis are your friend here.

    The second example is stupid sotware design:
    It is quite often that streamer announce when they start to stream and then people are waiting already on the page for them to go online.
    Well, we have the chat already connected anyway, why not send a special message over the chat to trigger the start of the stream...
    Sounds easy, for our system it is
  • Because than again you weill self ddos yourself, imagine this with 100k people waiting...
  • So sometimes realtime is really bad, because it is realtime... And it can destroy you
  • So we got back to the good old interval because then you distribute the 100k connections over 30 seconds, giving you much more time handle the load.

    So, we can party again!
  • The same guy as at the beginning, he has its own website with animated gifs 
  • Well, at the end something that is very important for me, monitor everything!
  • Our swiss army knife is statsd from etsy, a great peace of software written in nodes that monitors stuff via udp and works great.
  • We use it in cobimation with graphite and monitor really everything.
    See the down-spike on active chat connections? That is when node is not able to keep the 10 seconds timing for the reporting of the stats, you get used to it 
  • Well, and at the end, is the chat system working? Does it scale?
  • Well, i dont have a screenshot about our latest record that was close to 200k, but this one shows you a channel with 100k people.
    All 154k connections where handled by 16 frontend servers and 8 backend servers, costing us around $20 for the evening.
  • And dont forget the network traffic!
    Around 160-200Mbit per machine, only text outgoing! These cheap machines are limited by around 200mbit.
    Thats it, thank you!
  • Just one mor think:
  • We are hiring!
  • How to develop innovative, scalable systems

    1. 1. How To Develop Innovative, Scalable Systems?
    2. 2. Chat System
    3. 3. How hard can it be?
    4. 4. • Authentication • Slowmode • Moderator • Admins • Subscribers • Timeout • Ban • Limits per user • Notify messages • Imagelog • Unlimited Rooms Not that easy! • IP-Ban • Raffle • Voting • Whisper • Blacklist • Block user • Posting Images • DDOS Protection • Limits per channel • Chatlog • System messages
    5. 5. Realtime
    6. 6. No IRC!
    7. 7. WebSocket Web Server UI Let’s try this PHP Mysql Auto scaling
    8. 8. Less than 2k
    9. 9. Back to the drawing Board!
    10. 10. WebSocket Load balancing Permission and security model (Admin, Mods, ...) Frontend Server Backend Server UI Ok, so let’s try this! Frontend Server NodeJs data storage Redis Cluster hitbox REST-API PHP Nginx Backend Server NodeJs Auto scalingAuto scaling Average roundtrip / message: < 300ms
    11. 11. • Small, cheap machines • Handle the connections, no logic • When it breaks it breaks only for a few user • Automatic Failover to another chat frontend server • Socket.io for handling websockets • Carrier for sending messages between front & back • Up & Downscale possible as needed Frontend Server
    12. 12. • Small, cheap machines • Handles all the logic • Stateless, can be restarted/upgraded any time • Easy expandable with new features • Up & Downscale possible as needed • Load balancing via round robin Backend Server
    13. 13. • Fast • I mean, REALLY fast! • You can cluster it • Easy to back up Redis
    14. 14. No single point of failure
    15. 15. Websockets...
    16. 16. WebSocket Load balancing Permission and security model (Admin, Mods, ...) Frontend Server Backend Server UI Ok, let’s fix Websockets Frontend Server NodeJs data storage Redis Cluster hitbox REST-API PHP Nginx Backend Server NodeJs Auto scaling Auto scaling Long Polling Fallback Fallback Server NodeJs
    17. 17. Script Kiddies
    18. 18. Validate everything!
    19. 19. • Frontend servers report CPU load every 10 Seconds • Lowest X frontend servers are send to the UI • UI selects a frontend server randomly from this five • If UI gets disconnected it removes server from list • UI tries another frontend server • IF no servers left UI gets X new frontend servers from API Load Balancing
    20. 20. 2000 Messages/seconds
    21. 21. Async.js for the win!
    22. 22. „Self“ DDOS
    23. 23. Cache early, Cache often!
    24. 24. Stupid Software Design
    25. 25. Sometimes Realtime is bad!
    26. 26. Monitor Everything!
    27. 27. Etsy‘s statsd
    28. 28. So, is it Working?
    29. 29. Thank you! max@hitbox.tv
    30. 30. We are hiring! jobs.hitbox.tv

    ×