Your SlideShare is downloading. ×
House - Dynamic Bandwidth Throttling in a Client Server ...
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

House - Dynamic Bandwidth Throttling in a Client Server ...

355
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
355
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Gamer needs to be a server We need a lot of bandwidth Early during Shadowrun, design wanted guarentee that they could do 16 players and really wanted to do more Not able to guarentee Design was not done, not sure what would require updating and how often Reality is that we always need to be smart about how we update
  • Home environment is hostile Machines can move
  • How do we adjust bandwidth? How do we adjust game state replication? When/why do we migrate? How we ensure hosts / how do we pick?
  • Connection Control Global Control History Control
  • Goal set by Global Control UDP Padding
  • Packet Loss Subsequent Acks Timeouts Counters to filter out spurious
  • Global Control Divide bandwidth among connections Grow till exceed and then backoff Ability to look at behavior across connections
  • More then 50% in recovery Single bad connection does not cause signal Could prohibit growth, but we account for that
  • Even distribution for the most part Some need extra due to voice Bandwidth might be limited for bad connections
  • Bad connection Counter accumulates every period with congestion Decrements every period without Thresh means its bad – in dog house Can recover Bandwidth limited No longer restricts growth
  • History Control Continuity Over Time Grow while able to deliver Back down if fail Measurements occur during periods
  • Period starts when goal is set Period canceled if goal is changed Ends when goal is met for some time Success if no recoveries - recorded Unknown if some recovers – no recorded Failure if too many recoveries - recorded
  • All success and failures recorded Circular buffer Holds hours of play From history can calculate reliable percentage / failure percentage
  • Reliable Bandwidth == ReliablePercentage >= 95%
  • Reliable bandwidth used as goal No history get conservative value Step up bandwidth will be tried
  • FailurePercentage less then or equal to 5% Consecutive successes after first step up will continue to step up Failure will cause it to fall back to lower bandwidth
  • After failure, many successes needed to try again 1 in 20 ratio picked – 2 hours of play
  • Good connection, Open Nat, Good Ping Default used Success Recorded Step up tried Success Recorded
  • Step up again will continue Till failure Connections will hit RTT thresholds Latency spike across all connections Some lag in game will occur All connections in recovery
  • Global control will notice majority recoveries causing recovery This will happen multiple times This will lead to failure Another step up will not occur for 20 periods
  • Hiccup will be filtered at either connection of global level Sustained problem will be recorded If continues will trash history Player no longer reliable and will essentially be removed from pool
  • Reliable bandwidth primary consideration Best reflects player/box ability to cope Must be able to meet game needs If best can’t be found, considers other possibilities in bands
  • Pool built from criteria Tie breakers applied Ping, Nat Type, Host reliability
  • Importance of type Time of last Client importance
  • Updates sent based on priority Time intensive task Both dev wise and runtime Requires iteration / tweaking / gameplay Critical for polish
  • Host migration Big debate for Shadowrun Halo 2 approach appears to encourage bad behavior Went with host initiated migration only – removing ability / incentive
  • Searches by good hosts put themselves into games without good hosts
  • Home sucks History identifies good hosts Design for consistency instead of rapid global adjustment Consistent good hosts picked / rewarded
  • Quick to respond to changing conditions Bad hosts eliminated relative quickly Games likely to disband anyway
  • Beta – good results Good hosts found time and time again when reset Bad hosts culled quickly
  • Transcript

    • 1.  
    • 2. Dynamic Bandwidth Throttling Bart House, Development Lead, Microsoft
    • 3. The Problem
      • Client / Server games require server
      • Server uses high outbound bandwidth
        • Bandwidth to update N clients at 30Hz
          • Bandwidth = N * N * per-client-update * 30
          • Bandwidth for 15 clients with just 5 byte update Bandwidth = 15 * 15 * 40 * 30 = 270kbps
        • Average home connection <300kbps
        • We never have enough bandwidth
        • Can not update ideally for large games
    • 4. The Problem (cont.)
      • Home machine is in a hostile environment
        • Many devices compete for bandwidth
          • Voice over IP can quickly saturate bandwidth
          • High bandwidth browsing and downloading
            • WoW patches, P2P downloads, etc…
          • And the problem will only get worse
            • MP3 players downloading in the background wirelessly
        • Many Xboxes are connected wirelessly
      • Home machine can move
        • Team LAN parties
    • 5. Questions we will answer
      • How do we adjust bandwidth utilization to match bandwidth availability?
        • This is the bulk of the talk
      • As available bandwidth changes, how do we adjust game state replication?
      • What do we do when a machine can no longer host?
      • When matchmaking, how do we ensure someone can host the game?
    • 6. Adjusting Bandwidth
      • Adjusting is done at three different levels
        • Connection Control
          • Server to client connection
          • Takes care of rapid adjustments due to client specific problems
        • Global Control
          • Server adjusts all client connections simultaneously
          • Adjustments for problems across multiple clients
            • Problems most likely due to local bottleneck
        • History Control
          • Server adjust overall bandwidth target as conditions change and evidence builds
          • Provides continuity during a game
          • Allows for growth and adjustment over course of multiple games
          • Provides basis for estimating future performance
    • 7. Connection Control
      • Bandwidth between two sever and client
      • Try to reach and maintain goal bandwidth
        • Goal bandwidth is set by Global Control
      • All traffic is UDP based
        • Reliable messaging built on top
        • Not part of talk today
      • Bandwidth is always used
        • Packets are handed up to game to fill
        • Unused packet space is padded
        • Ensures bandwidth is available when needed
    • 8. Congestion Control
      • When congestion is detected
        • Reduce current bandwidth
        • When congestion has cleared, increase current bandwidth over time until goal is reached
        • Maintain bandwidth at goal
      • Two signals for congestion control
        • Increase in measured RTT
        • Packet loss due to timeout or subsequent acknowledgements
    • 9. Congestion Control States
      • Maintain State
        • Remain in this state if goal is reached
        • Transition to Growth if able to maintain current bandwidth for some period of time
      • Recovery State
        • Entered when congestion is detected
        • Transition to Maintain when able to achieve current bandwidth and when RTT has stabilized
    • 10. Congestion Control States
      • Growth State
        • Growth rate established when target was set
          • Allows for rapidly growing connections when appropriate
        • Growth is stopped when congestion is detected
          • Recovery state is entered
        • Growth is stopped if measured throughput fails to come close to current bandwidth
          • Maintain state is entered
        • Growth occurs in steps
          • Next step taken when measured throughput comes close enough to current bandwidth
        • RTT threshold used for congestion warning signal
          • Established when state is entered
          • Adjusted as packet sizes increase
    • 11. RTT
      • Primary congestion control signal
      • To calculate RTT
        • Timestamp in every packet
          • Allowed us to see changes in RTT more quickly
        • Used low pass filter to calculate smooth RTT
      • Baseline RTT established in Maintain state
      • Significant deviation from baseline used a signal of congestion
    • 12. Packet-loss
      • Two types of packet-loss
        • Loss due to subsequent acknowledgement
          • If we get multiple subsequent acknowledgements we take this as an indication of a packet loss
        • Loss due to timeout
          • Packet failed to be acknowledged after some period of time
          • Timeout calculated from filtered RTT
      • Only causes a congestion control signal if multiple events encountered over an interval of time
        • Spurious packet loss does not trigger control signal
    • 13. Global Control
      • The server keeps a current goal bandwidth which is divided among all client connections
      • Attempt to reach goal by growing bandwidth backing off when bandwidth is exceeded
      • Ability to look at behavior across multiple connections allows detection of bad connections
    • 14. Global Control States
      • Recovery State
        • Entered when bandwidth is exceeded
        • Slow Adjustment is entered when fully recovered
      • Rapid Growth State
        • Quick adjustments are made until bandwidth is exceeded or goal is encountered
      • Slow Adjustment State
        • Small incremental growth until bandwidth is exceeded or goal is reached
      • Goal Reached State
        • Goal bandwidth is maintained until bandwidth is exceeded
    • 15. Global Recovery State
      • Continue to reduce current global bandwidth as long as bandwidth is being exceeded
      • Reduction occurs at regular interval
      • Percentage reduction applied until some minimum
      • Current bandwidth is reduced immediately when state is entered
        • Amount of reduction is less when entered from rapid growth
    • 16. Global Slow/Rapid Growth
      • Grow bandwidth in steps
      • Each step is a small/large percentage of current global bandwidth
      • Next step taken when global bandwidth is measured and maintained over a period of time
      • Growth continues until bandwidth is exceeded or goal is reached
    • 17. Detecting Bandwidth Overuse
      • If over 50% of the connections are in there recovery control state we assume that the bandwidth is exceeded
      • Single bad connection can affect global control
        • Will not cause bandwidth exceeded signal
        • But can prohibit growth of bandwidth due to failure to deliver its share of throughput
    • 18. Dividing Bandwidth
      • Even bandwidth distribution among client connections
      • Extra bandwidth given based on need
        • Some clients act as voice repeaters and thus require additional bandwidth
      • Bandwidth might be limited for bad connections
    • 19. Bad Connections
      • Any connection that accumulates congestion signals over time is eventually marked as bad
        • We keep a counter that accumulates for every period that experienced a congestion period and decrements for every period that did not
      • Bandwidth to that connection is limited
      • It’s throughput is no longer taken into consideration when determining whether goal bandwidth is met
        • All traffic sent is added to total throughput since we can’t rely on acknowledgements from connection
    • 20. History Control
      • Global bandwidth is adjusted over time
      • Global goal bandwidth
        • Adjusted up linearly
          • If the host is able to consistently maintain it
        • Adjusted down exponentially
          • If the host fails to maintain it
      • Bandwidth History Period
        • Global goal held constant
        • Periods used as quantum of measurement
    • 21. Bandwidth History Periods
      • Starts when a global goal bandwidth is set
      • Period ends when:
        • Goal changes and the period is canceled
          • For instance due to a client leaving
        • Goal reached and held for a period of time
          • No global control recoveries occurred
            • Period is considered successful
            • Successful result recorded with goal bandwidth
          • Some global control recoveries occurred
            • Period is neither successful or failed
            • Nothing is recorded
        • Failure occurs
          • Goal not reached given sufficient time
          • Multiple global control recoveries occurred
    • 22. Bandwidth History
      • Successful/failed periods are recorded
      • Recorded in circular buffer
        • Large enough to hold many hours of play
      • Stored using per-user live storage
        • History is tied both to box and the player
      • From the history, we can calculate:
        • Reliability percentage for some bandwidth
        • Failure percentage given some bandwidth
    • 23. Reliability Percentage
      • Give some bandwidth X, how reliable has this machine been at delivering that bandwidth
        • Success(X) / (Success(X) + Failure(X))*100
          • Success(X) is the number of successful periods at or above bandwidth X
          • Where Failure(X) is the number of failed periods at or below bandwidth X
    • 24. Failure Percentage
      • Given some bandwidth X, how often have we failed to deliver that bandwidth
        • Failure(X) / (TotalSuccess + Failure(X))*100
          • Where TotalSuccess is the total number of successful periods regardless of bandwidth
    • 25. Reliable Bandwidth
      • Greatest bandwidth X such that ReliabilityPercentage(X) >= 95%
      • Stated another way
        • What bandwidth should we pick to ensure that we will only get a 1 in 20 chance of having a failure to maintain that bandwidth
      • We want to ensure overall consistency in game play
    • 26. Use of Reliable Bandwidth
      • Almost always used as global goal
      • Two exceptions
        • When no history is present
          • Conservative estimate is used
          • Enough to be considered when picking host
          • But not too much
          • Below average home connection speed
          • Reduce chance of poor game play at game start
    • 27. Trying Higher Bandwidth
      • Only consider one step higher
        • A bandwidth that is one step higher then our reliable bandwidth
      • Will use this higher bandwidth if:
        • FailurePercentage(X) <= 5%
      • Success or failure will be recorded
      • Logic runs again
    • 28. Trying Higher Bandwidths
      • Consider this case
        • Recent failure at 320kbps (stepped up)
        • Connection speed at 300kbps
        • ReliableBandwidth at 300kbps
        • FailurePercentage(320) > 5%
      • What happens
        • 300kbps will be used repeatedly
        • Each success slowly reduces FailurePercentage(320)
        • Eventuall, FailurePercentage(320) <= 5%
        • 320kbps will be tried
        • Failure will be record
        • This will repeat trying 320 once in every 2 hours
    • 29. Bandwidth Control In Action
      • Lets consider some typical scenarios
        • No history but good connection
        • No history but bad bandwidth
        • Good history but temporary problem
    • 30. No History Good Connection
      • A client with no history will default to assuming a reasonable reliable bandwidth
        • Assumption is enough to host a moderate size game
        • If host has good pings to other clients and open NAT, it is likely they will be picked to serve
      • Global Goal Bandwidth will use default
        • Default is conservative
        • Below average home connection
        • Goal will be achievable > 50% of the time
        • Bandwidth period will be recorded as a success
        • Next bandwidth step will immediately be tried
          • FailurePercentage(x) is always 0% until a failure
    • 31. No History Good Connection
      • Cont.
        • Higher bandwidth will likely succeed
        • Another success recorded
        • Higher and higher bandwidths will be tried
        • Eventually bandwidth will approach actual
        • Latency spike on all connections will occur
        • RTT congestion signal will trigger
        • Connection control recovery state entered across all connections
    • 32. No History Good Connection
      • Cont.
        • Majority congestion problems
          • Global recovery state will be entered
        • After recovery
          • Slow growth will then be follwed by another recovery
        • Multiple recoveries will cause period failure
        • Reported failure will stop increases
        • Another step up will not occur for a while
        • Bandwidth usage stabilizes just below actual bandwidth
    • 33. No History Bad Bandwidth
      • Initial guess is too high
      • Congestion will be seen immediately
      • First period will be reported as a failure
      • Step down bandwidth is tried next
      • If this fails, step downs are increased exponentially
      • Eventually bandwidth will be set below actual bandwidth
      • Bandwidth will stabilize here
    • 34. Good History Hiccups Occur
      • Bandwidth is stable below actual bandwidth
      • Something happens
        • Available bandwidth is lowered
      • If it occurs very briefly
        • Connections will briefly experience congestion
        • Bandwidth across all connections will be dropped
        • Single global recovery will occur
        • Period will not fail
      • If it occurs over a sustained period of time
        • Failure will be recorded
        • Reduced bandwidth used next
        • Reductions will continue
        • RelaibleBandwidth is significantly reduced
        • This is what we want
    • 35. Picking Best Host
      • Reliable bandwidth is primary factor
      • Reflects ability of that box under that players control to reliable deliver (at least 95% of the time) bandwidth
      • Basing the estimate on a long history captures the ability of the player and box to survive in the hostile home environment
    • 36. Picking Best Host (cont.)
      • We build a pool of hosts from those that have a reliable bandwidth large enough for current game size
      • From this host pool we use other criteria as tie breakers
        • Ping times to other players
        • Nat type (open preferred over moderate)
        • Percentage of games left gracefully
          • Graceful exits are those that the game has a chance to remove the player from the game before the box is turned off or the network cable is unplugged
    • 37. Game State Replication
      • Game state
        • Player positions
        • Weapon damage
        • Object positions in world
        • Weapon in hand
        • Player health
        • Object damage
      • As bandwidth between the server and a client changes, the game must react appropriately to the changing conditions
    • 38. Priority Based Replication
      • Updates are assigned a priority based on:
        • Importance of update
          • Player position more important then player health
        • Time of last update
          • The longer since last update, the higher the priority
        • Client importance
          • Can the players on that client see/hear what is being updated
    • 39. Priority Based Replication
      • Updates are then sent based on priority
      • Time intensive task to get priority scheme right
        • Must play, take traces and decide whether the right decisions are being made by priority system when under load
        • This hand tuning is critical to get the best polish for the game
    • 40. Host Migration
      • Host migration is moving the hosting responsibilities from one box to another box
      • Shadowrun only supports host initiated migration
        • Eliminates the potential for exploitation
        • Halo2’s host election process had unintended consequences
          • Encouraged griefing
    • 41. Host Migration
      • Host is migrated when:
        • Host chooses to leave
          • Game will end, hosting is migrated and host is then removed from game
        • Current host is no longer best
          • Host changed between games
          • Consistent gameplay for duration of game
          • Between rounds would perhaps have been better
        • Game is prematurely stopped
          • Host bandwidth no longer supports game size
    • 42. Matchmaking
      • Good hosts
        • Good bandwidth
        • Open Nat
        • High Hosting Reliability
      • Good hosts will favor games without
        • Will try to join a game that needs good host before trying to join one that does not
        • But will only do so if game is a good match
    • 43. Putting It All Together
      • Home is a hard place to serve from
      • History attempts to identify players who can manage it well
      • Design focuses on consistency
        • Global bandwidth increases are made over multiple rounds of play
      • Good hosts are rewarded by having host advantage and thus encouraging players to be good hosts
    • 44. Putting It All Together (cont.)
      • Quick to respond to changing conditions
        • Low level point-to-point control ensures continued connectivity
        • This response happens very quickly
      • If problems persist, global bandwidth control will kick in and reduce overall targets within minutes
        • Host will eventually be replaced
        • Bad host will have low reliable bandwidth
    • 45. Putting It All Together
      • During Beta bandwidth histories accurately reflected player connection abilities
      • System repeatedly found the same good hosts as system histories were reset
      • Bad hosts did cause bad rounds of play but where quickly eliminated from pool of hosts in future games
    • 46. Wish List
      • Ability for game to know when box has moved
        • Create a signature that can be stored with the history that represents the network location
        • For instance using the MAC of the local gateway along with perhaps routing information to known service
      • Ability to manage QoS in the home
        • Demands for bandwidth in the home are only going to get worse
        • Efforts need to be made to help manage bandwidth across devices in the home
    • 47. Wish List (cont.)
        • Consistent Bandwidth Control and Prediction across titles and platforms
          • Game developers should be relieved of this job
          • Common problem for many games
          • History is applicable across titles and platforms
        • Power Off UDP Packet Delivery
          • Add ability for hardware to send out notification of powering down before actually powering down
          • This will allow others who are connected to game to be notified of removal of box and thus can handle it gracefully