Ensuring Your Technology Will
Scale
Niniane Wang
Basis Set Ventures
October 2019
Speaker Background
VP of Engineering at Niantic (acquired my startup)
Board Member at Serena & Lily
Advisor to Basis Set Ventures
Founder / CEO of startup Evertoon
CTO of Minted
Led Gmail Ads eng, cofounded Google Desktop (75M active users)
Eng manager on Microsoft Flight Simulator
Graduated from Caltech in computer science at age 18
Technology Stacks
Single shared game
world, with
geographically indexed
database tables.
● Java
● Datastore, Spanner
● Running on Google
Cloud Platform
Commerce platform with
algorithmically predicted
crowdsourced designs:
● Python
● Flask microframework
● MySQL
● Dedicated Rackspace
(at the time)
Various services:
● Java
● Borg
● Internal Google Cloud
Platform
Traffic Spikes
Each service I’ve worked on has
experienced spikes.
Niantic:
● Launch of Pokemon GO, Harry
Potter: Wizards Unite
Minted:
● Television appearances
Google:
● Launch of any Google product
Pokémon GO launch
Agenda
● Optimize your loadtesting
● How to handle the unexpected
● Tips for working with third-party dependencies
● The tough decision of when to re-architect
Optimizing Loadtesting
Loadtesting: Review of Metholodogy
Set up sequences of API calls based on common user journeys
○ Simulate a user doing the core experience of your product
○ Simulate a situation likely to cause contention, similar to Google Docs
with 100 simultaneous editors
● Use a tool such as Apache Bench to simulate simultaneous users
● We spun up hundreds of server instances to simulate users
When Traditional Loadtesting Works Best
Pros:
● Can re-run any time at your
convenience.
● This means you can make fixes and
then run the loadtest again.
● This is well-suited to the early stages
of doing scalability work, when
there’s many bottlenecks to uncover.
Cons
● Inevitably will have some
discrepancy from actual user API
calling patterns
● Won’t simulate the variety of user
locations, cache hits / misses,
different user devices
Using Real Users to Expose Bottlenecks
After you’ve gotten past the initial bottlenecks, it’s time to more closely simulate real user
traffic.
In your beta, you can reduce your server resources, to expose bottlenecks.
● For example, let’s say you expect to serve 1M users with 200 servers.
● Open your beta to 20,000 users, and use 4 servers.
● Hold an event that encourages simultaneous use.
● Or take one server out of rotation, and look at which resource gets
bottlenecked first.
Traffic Shadowing
You can send real user traffic to an unlaunched service, using asynchronous calls.
This mimics real traffic patterns:
● Which edge servers are hit
● CDN and other caches
If the unlaunched service goes down, make sure that real user traffic won’t get affected.
function doOperationA () {
…
start asynchronous thread to do operation B
…
return value for operation A
Handling the Unexpected
Give Real-time Levers to Your Future Self
Most bottlenecks you encounter will be unexpected. (If they were
expected, you can fix them pre-launch.)
Q: How can you react quickly to unexpected problems?
A: Give tools to your future self:
1. Charts to visualize server metrics
2. Real-time levers to reduce resource contention
Example 1: Directing Traffic to Servers
In one situation, we split traffic across servers:
Ways of distributing traffic across servers:
● Round robin
● Algorithm to keep similar traffic on the same server, to reduce inter-server calls
We created a switch that can move between methods. If the algorithm was too uneven,
we could switch to round robin.
Example 2: Reducing Database Contention
For database contention, a certain high-traffic startup used a trick to count views, after a
page was getting a lot of views:
● 25% of the time, they added 3 views
● 25% of the time, they added 1 view
● 50% of the time, they didn’t add any views
This isn’t precise, but it gave them the option of cutting database calls in half during
high-traffic times.
Example 3: Config Value for API Call
In one situation, each user had a token that needed to be refreshed every X minutes.
We made X a real-time configurable value.
When there was a bottleneck on refreshing the token, X can be increased in real-time.
Types of Levers to Consider
● Feature flags to temporarily disable non-critical features
● Config values that control the frequency of API calls, token
refreshes, periodic jobs
● Ability to add additional servers to a pool
● Reduce API calls from clients (e.g. mobile apps) in the wild
○ E.g. Retry loops
Resources that Could Become Bottlenecks
Here are the most common resource contentions. What levers can you add?
● Database contention (e.g. volume of queries, hotspots)
● Worker pools and worker queues
● Disk space
● Memory
● CPU
● Third-party partner/vendor dependencies
Practice in Advance
Practice these and document the steps & pitfalls:
● Restore data from backup
● Reboot one specific server
● Add a server into the rotation
● Failover to another zone
● … other custom emergencies based on your architecture ...
Third-Party Dependencies
Third Party Dependencies
You will always have some reliance on third-party vendors:
● Login
● Analytics
● Social services
● Marketing services
● Customer service, e.g. live chat
● ...
Vendor Reassurances
Vendors will reassure you:
“We have bigger customers than you. We can handle 50 times your traffic level.”
“Black Friday is make-or-break for us, and we’ve gone all-out to prepare.”
“If we couldn’t scale up to meet demand, we wouldn’t have any customers.”
My advice:
1. Align incentives via SLAs in the contract.
2. Do test runs.
Align Incentives via Money Behind The Promises
Write into the contract that you get a refund if they fail their SLA (Service Level
Agreement).
An example SLA that I like to ask for:
● 10% refund below 99.9% uptime (43 minutes outage in a month)
● 25% refund below 99.7% uptime (2.2 hours in a month)
● 50% refund below 99.5% uptime (3.6 hours in a month)
Negotiation of SLA
Common negotiation rebuttals from the vendor:
● “We use a standard contract, and we don’t put refunds in any of our contracts.”
○ In my experience, they always ended up putting the SLA into the contract.
○ Occasionally they want a carveout for outages that are not their fault (e.g. if
AWS went down). I added the carveout.
● “Instead of a refund, we’ll release you from the contract if we fail the SLA.”
○ It is a big investment for you to switch vendors. They should pay the cost for
their outage, not your engineering team.
● “You can talk to our CTO & our customer who will tell you that our uptime is great.”
○ That’s good, but there’s no substitute for the SLA in the contract.
Making the SLA Count
If their uptime falls below SLA, always ask for the refund.
Even if you lost $1M due to their outage, and the refund is only $5,000, ask for the refund.
When they give you the refund, that refund will show up on their P&L, which will unite
their executives, PMs, and Board of Directors with providing good service for you.
If they don’t have to give a refund, they will be torn between signing up new customers vs
fixing this issue for existing customers.
Test out the Outage Reporting Process
Do test-runs of reporting an outage.
This can often be helpful to the vendor too. They may realize they need to have better
escalation or handoff procedures, or that they need to improve the training to their
technical-support staff.
Conduct a test during your beta, so that you are familiar with the process and they can
iron out the wrinkles.
Re-Architecting
Deciding Trade-off
There will always be “more that you can do” to prepare for scalability.
One tough decision is in changing architecture, e.g.
● switch to a more performant database or CDN
● go multi-zone or multi-region
● switch hosting providers
● rewrite part of your stack in another framework or programming language
Changing architecture is usually a hard slog.
● Takes longer than expected
● Opportunity cost of using that time to create revenue-driving features
Questions to Decide Whether to Proceed
● Is this causing frequent real-world bottlenecks, or are you anticipating / predicting?
○ If it’s not yet causing bottlenecks, can you delay the re-architecture?
● If this is causing bottlenecks but they are infrequent (e.g. once per month), is there
a way to lessen the pressure?
○ E.g. Direct part of the traffic to another service?
● If you delay the re-architecture, does it become vastly harder later?
○ Examples of product areas that are hard to change with additional scale (and
thus you might want to do the re-architecture while it’s easier to change):
■ Login methods
■ Database technology
If There’s Internal Debate...
If there’s fierce internal debate about the re-architecture:
● If there’s disagreement, can you port one less-contentious feature so that you have
real-world data to discuss?
○ E.g. Niantic ported the user account system (lower QPS) before moving
databases for entire products.
● Make a detailed time-estimate listing every sub-task (with a one-week granularity).
○ This tells everyone the “price” in development time, so they can make the
cost-benefit tradeoff.
○ Sometimes, re-architecture happens when the team has guess-timated how
long it will take, and has estimated too low.
After You’ve Embarked
After you’ve made the decision to do the re-architecture:
● Look for ways to do the re-architecture one piece at a time and derive benefit, rather
than a wholesale rewrite that will be much harder to coordinate.
● Follow the detailed cost-estimate you made (referenced on the last slide), so that
you can tell each week whether you’re on track schedule-wise.
Would love to hear from you!
niniane@gmail.com

Ensuring Your Technology Will Scale

  • 1.
    Ensuring Your TechnologyWill Scale Niniane Wang Basis Set Ventures October 2019
  • 2.
    Speaker Background VP ofEngineering at Niantic (acquired my startup) Board Member at Serena & Lily Advisor to Basis Set Ventures Founder / CEO of startup Evertoon CTO of Minted Led Gmail Ads eng, cofounded Google Desktop (75M active users) Eng manager on Microsoft Flight Simulator Graduated from Caltech in computer science at age 18
  • 3.
    Technology Stacks Single sharedgame world, with geographically indexed database tables. ● Java ● Datastore, Spanner ● Running on Google Cloud Platform Commerce platform with algorithmically predicted crowdsourced designs: ● Python ● Flask microframework ● MySQL ● Dedicated Rackspace (at the time) Various services: ● Java ● Borg ● Internal Google Cloud Platform
  • 4.
    Traffic Spikes Each serviceI’ve worked on has experienced spikes. Niantic: ● Launch of Pokemon GO, Harry Potter: Wizards Unite Minted: ● Television appearances Google: ● Launch of any Google product Pokémon GO launch
  • 5.
    Agenda ● Optimize yourloadtesting ● How to handle the unexpected ● Tips for working with third-party dependencies ● The tough decision of when to re-architect
  • 6.
  • 7.
    Loadtesting: Review ofMetholodogy Set up sequences of API calls based on common user journeys ○ Simulate a user doing the core experience of your product ○ Simulate a situation likely to cause contention, similar to Google Docs with 100 simultaneous editors ● Use a tool such as Apache Bench to simulate simultaneous users ● We spun up hundreds of server instances to simulate users
  • 8.
    When Traditional LoadtestingWorks Best Pros: ● Can re-run any time at your convenience. ● This means you can make fixes and then run the loadtest again. ● This is well-suited to the early stages of doing scalability work, when there’s many bottlenecks to uncover. Cons ● Inevitably will have some discrepancy from actual user API calling patterns ● Won’t simulate the variety of user locations, cache hits / misses, different user devices
  • 9.
    Using Real Usersto Expose Bottlenecks After you’ve gotten past the initial bottlenecks, it’s time to more closely simulate real user traffic. In your beta, you can reduce your server resources, to expose bottlenecks. ● For example, let’s say you expect to serve 1M users with 200 servers. ● Open your beta to 20,000 users, and use 4 servers. ● Hold an event that encourages simultaneous use. ● Or take one server out of rotation, and look at which resource gets bottlenecked first.
  • 10.
    Traffic Shadowing You cansend real user traffic to an unlaunched service, using asynchronous calls. This mimics real traffic patterns: ● Which edge servers are hit ● CDN and other caches If the unlaunched service goes down, make sure that real user traffic won’t get affected. function doOperationA () { … start asynchronous thread to do operation B … return value for operation A
  • 11.
  • 12.
    Give Real-time Leversto Your Future Self Most bottlenecks you encounter will be unexpected. (If they were expected, you can fix them pre-launch.) Q: How can you react quickly to unexpected problems? A: Give tools to your future self: 1. Charts to visualize server metrics 2. Real-time levers to reduce resource contention
  • 13.
    Example 1: DirectingTraffic to Servers In one situation, we split traffic across servers: Ways of distributing traffic across servers: ● Round robin ● Algorithm to keep similar traffic on the same server, to reduce inter-server calls We created a switch that can move between methods. If the algorithm was too uneven, we could switch to round robin.
  • 14.
    Example 2: ReducingDatabase Contention For database contention, a certain high-traffic startup used a trick to count views, after a page was getting a lot of views: ● 25% of the time, they added 3 views ● 25% of the time, they added 1 view ● 50% of the time, they didn’t add any views This isn’t precise, but it gave them the option of cutting database calls in half during high-traffic times.
  • 15.
    Example 3: ConfigValue for API Call In one situation, each user had a token that needed to be refreshed every X minutes. We made X a real-time configurable value. When there was a bottleneck on refreshing the token, X can be increased in real-time.
  • 16.
    Types of Leversto Consider ● Feature flags to temporarily disable non-critical features ● Config values that control the frequency of API calls, token refreshes, periodic jobs ● Ability to add additional servers to a pool ● Reduce API calls from clients (e.g. mobile apps) in the wild ○ E.g. Retry loops
  • 17.
    Resources that CouldBecome Bottlenecks Here are the most common resource contentions. What levers can you add? ● Database contention (e.g. volume of queries, hotspots) ● Worker pools and worker queues ● Disk space ● Memory ● CPU ● Third-party partner/vendor dependencies
  • 18.
    Practice in Advance Practicethese and document the steps & pitfalls: ● Restore data from backup ● Reboot one specific server ● Add a server into the rotation ● Failover to another zone ● … other custom emergencies based on your architecture ...
  • 19.
  • 20.
    Third Party Dependencies Youwill always have some reliance on third-party vendors: ● Login ● Analytics ● Social services ● Marketing services ● Customer service, e.g. live chat ● ...
  • 21.
    Vendor Reassurances Vendors willreassure you: “We have bigger customers than you. We can handle 50 times your traffic level.” “Black Friday is make-or-break for us, and we’ve gone all-out to prepare.” “If we couldn’t scale up to meet demand, we wouldn’t have any customers.” My advice: 1. Align incentives via SLAs in the contract. 2. Do test runs.
  • 22.
    Align Incentives viaMoney Behind The Promises Write into the contract that you get a refund if they fail their SLA (Service Level Agreement). An example SLA that I like to ask for: ● 10% refund below 99.9% uptime (43 minutes outage in a month) ● 25% refund below 99.7% uptime (2.2 hours in a month) ● 50% refund below 99.5% uptime (3.6 hours in a month)
  • 23.
    Negotiation of SLA Commonnegotiation rebuttals from the vendor: ● “We use a standard contract, and we don’t put refunds in any of our contracts.” ○ In my experience, they always ended up putting the SLA into the contract. ○ Occasionally they want a carveout for outages that are not their fault (e.g. if AWS went down). I added the carveout. ● “Instead of a refund, we’ll release you from the contract if we fail the SLA.” ○ It is a big investment for you to switch vendors. They should pay the cost for their outage, not your engineering team. ● “You can talk to our CTO & our customer who will tell you that our uptime is great.” ○ That’s good, but there’s no substitute for the SLA in the contract.
  • 24.
    Making the SLACount If their uptime falls below SLA, always ask for the refund. Even if you lost $1M due to their outage, and the refund is only $5,000, ask for the refund. When they give you the refund, that refund will show up on their P&L, which will unite their executives, PMs, and Board of Directors with providing good service for you. If they don’t have to give a refund, they will be torn between signing up new customers vs fixing this issue for existing customers.
  • 25.
    Test out theOutage Reporting Process Do test-runs of reporting an outage. This can often be helpful to the vendor too. They may realize they need to have better escalation or handoff procedures, or that they need to improve the training to their technical-support staff. Conduct a test during your beta, so that you are familiar with the process and they can iron out the wrinkles.
  • 26.
  • 27.
    Deciding Trade-off There willalways be “more that you can do” to prepare for scalability. One tough decision is in changing architecture, e.g. ● switch to a more performant database or CDN ● go multi-zone or multi-region ● switch hosting providers ● rewrite part of your stack in another framework or programming language Changing architecture is usually a hard slog. ● Takes longer than expected ● Opportunity cost of using that time to create revenue-driving features
  • 28.
    Questions to DecideWhether to Proceed ● Is this causing frequent real-world bottlenecks, or are you anticipating / predicting? ○ If it’s not yet causing bottlenecks, can you delay the re-architecture? ● If this is causing bottlenecks but they are infrequent (e.g. once per month), is there a way to lessen the pressure? ○ E.g. Direct part of the traffic to another service? ● If you delay the re-architecture, does it become vastly harder later? ○ Examples of product areas that are hard to change with additional scale (and thus you might want to do the re-architecture while it’s easier to change): ■ Login methods ■ Database technology
  • 29.
    If There’s InternalDebate... If there’s fierce internal debate about the re-architecture: ● If there’s disagreement, can you port one less-contentious feature so that you have real-world data to discuss? ○ E.g. Niantic ported the user account system (lower QPS) before moving databases for entire products. ● Make a detailed time-estimate listing every sub-task (with a one-week granularity). ○ This tells everyone the “price” in development time, so they can make the cost-benefit tradeoff. ○ Sometimes, re-architecture happens when the team has guess-timated how long it will take, and has estimated too low.
  • 30.
    After You’ve Embarked Afteryou’ve made the decision to do the re-architecture: ● Look for ways to do the re-architecture one piece at a time and derive benefit, rather than a wholesale rewrite that will be much harder to coordinate. ● Follow the detailed cost-estimate you made (referenced on the last slide), so that you can tell each week whether you’re on track schedule-wise.
  • 31.
    Would love tohear from you! niniane@gmail.com