Ensuring Your Technology Will Scale

Ensuring Your Technology Will
Scale
Niniane Wang
Basis Set Ventures
October 2019

Speaker Background
VP of Engineering at Niantic (acquired my startup)
Board Member at Serena & Lily
Advisor to Basis Set Ventures
Founder / CEO of startup Evertoon
CTO of Minted
Led Gmail Ads eng, cofounded Google Desktop (75M active users)
Eng manager on Microsoft Flight Simulator
Graduated from Caltech in computer science at age 18

Technology Stacks
Single shared game
world, with
geographically indexed
database tables.
● Java
● Datastore, Spanner
● Running on Google
Cloud Platform
Commerce platform with
algorithmically predicted
crowdsourced designs:
● Python
● Flask microframework
● MySQL
● Dedicated Rackspace
(at the time)
Various services:
● Java
● Borg
● Internal Google Cloud
Platform

Traﬃc Spikes
Each service I’ve worked on has
experienced spikes.
Niantic:
● Launch of Pokemon GO, Harry
Potter: Wizards Unite
Minted:
● Television appearances
Google:
● Launch of any Google product
Pokémon GO launch

Agenda
● Optimize your loadtesting
● How to handle the unexpected
● Tips for working with third-party dependencies
● The tough decision of when to re-architect

Loadtesting: Review of Metholodogy
Set up sequences of API calls based on common user journeys
○ Simulate a user doing the core experience of your product
○ Simulate a situation likely to cause contention, similar to Google Docs
with 100 simultaneous editors
● Use a tool such as Apache Bench to simulate simultaneous users
● We spun up hundreds of server instances to simulate users

When Traditional Loadtesting Works Best
Pros:
● Can re-run any time at your
convenience.
● This means you can make ﬁxes and
then run the loadtest again.
● This is well-suited to the early stages
of doing scalability work, when
there’s many bottlenecks to uncover.
Cons
● Inevitably will have some
discrepancy from actual user API
calling patterns
● Won’t simulate the variety of user
locations, cache hits / misses,
different user devices

Using Real Users to Expose Bottlenecks
After you’ve gotten past the initial bottlenecks, it’s time to more closely simulate real user
traﬃc.
In your beta, you can reduce your server resources, to expose bottlenecks.
● For example, let’s say you expect to serve 1M users with 200 servers.
● Open your beta to 20,000 users, and use 4 servers.
● Hold an event that encourages simultaneous use.
● Or take one server out of rotation, and look at which resource gets
bottlenecked ﬁrst.

Traffic Shadowing
You can send real user traffic to an unlaunched service, using asynchronous calls.
This mimics real traffic patterns:
● Which edge servers are hit
● CDN and other caches
If the unlaunched service goes down, make sure that real user traffic won’t get affected.
function doOperationA () {
…
start asynchronous thread to do operation B
…
return value for operation A

Give Real-time Levers to Your Future Self
Most bottlenecks you encounter will be unexpected. (If they were
expected, you can ﬁx them pre-launch.)
Q: How can you react quickly to unexpected problems?
A: Give tools to your future self:
1. Charts to visualize server metrics
2. Real-time levers to reduce resource contention

Example 1: Directing Traffic to Servers
In one situation, we split traffic across servers:
Ways of distributing traffic across servers:
● Round robin
● Algorithm to keep similar traffic on the same server, to reduce inter-server calls
We created a switch that can move between methods. If the algorithm was too uneven,
we could switch to round robin.

Example 2: Reducing Database Contention
For database contention, a certain high-traﬃc startup used a trick to count views, after a
page was getting a lot of views:
● 25% of the time, they added 3 views
● 25% of the time, they added 1 view
● 50% of the time, they didn’t add any views
This isn’t precise, but it gave them the option of cutting database calls in half during
high-traﬃc times.

Example 3: Conﬁg Value for API Call
In one situation, each user had a token that needed to be refreshed every X minutes.
We made X a real-time conﬁgurable value.
When there was a bottleneck on refreshing the token, X can be increased in real-time.

Types of Levers to Consider
● Feature ﬂags to temporarily disable non-critical features
● Conﬁg values that control the frequency of API calls, token
refreshes, periodic jobs
● Ability to add additional servers to a pool
● Reduce API calls from clients (e.g. mobile apps) in the wild
○ E.g. Retry loops

Resources that Could Become Bottlenecks
Here are the most common resource contentions. What levers can you add?
● Database contention (e.g. volume of queries, hotspots)
● Worker pools and worker queues
● Disk space
● Memory
● CPU
● Third-party partner/vendor dependencies

Practice in Advance
Practice these and document the steps & pitfalls:
● Restore data from backup
● Reboot one speciﬁc server
● Add a server into the rotation
● Failover to another zone
● … other custom emergencies based on your architecture ...

Third Party Dependencies
You will always have some reliance on third-party vendors:
● Login
● Analytics
● Social services
● Marketing services
● Customer service, e.g. live chat
● ...

Vendor Reassurances
Vendors will reassure you:
“We have bigger customers than you. We can handle 50 times your traﬃc level.”
“Black Friday is make-or-break for us, and we’ve gone all-out to prepare.”
“If we couldn’t scale up to meet demand, we wouldn’t have any customers.”
My advice:
1. Align incentives via SLAs in the contract.
2. Do test runs.

Align Incentives via Money Behind The Promises
Write into the contract that you get a refund if they fail their SLA (Service Level
Agreement).
An example SLA that I like to ask for:
● 10% refund below 99.9% uptime (43 minutes outage in a month)
● 25% refund below 99.7% uptime (2.2 hours in a month)
● 50% refund below 99.5% uptime (3.6 hours in a month)

Negotiation of SLA
Common negotiation rebuttals from the vendor:
● “We use a standard contract, and we don’t put refunds in any of our contracts.”
○ In my experience, they always ended up putting the SLA into the contract.
○ Occasionally they want a carveout for outages that are not their fault (e.g. if
AWS went down). I added the carveout.
● “Instead of a refund, we’ll release you from the contract if we fail the SLA.”
○ It is a big investment for you to switch vendors. They should pay the cost for
their outage, not your engineering team.
● “You can talk to our CTO & our customer who will tell you that our uptime is great.”
○ That’s good, but there’s no substitute for the SLA in the contract.

Making the SLA Count
If their uptime falls below SLA, always ask for the refund.
Even if you lost $1M due to their outage, and the refund is only $5,000, ask for the refund.
When they give you the refund, that refund will show up on their P&L, which will unite
their executives, PMs, and Board of Directors with providing good service for you.
If they don’t have to give a refund, they will be torn between signing up new customers vs
ﬁxing this issue for existing customers.

Test out the Outage Reporting Process
Do test-runs of reporting an outage.
This can often be helpful to the vendor too. They may realize they need to have better
escalation or handoff procedures, or that they need to improve the training to their
technical-support staff.
Conduct a test during your beta, so that you are familiar with the process and they can
iron out the wrinkles.

Deciding Trade-off
There will always be “more that you can do” to prepare for scalability.
One tough decision is in changing architecture, e.g.
● switch to a more performant database or CDN
● go multi-zone or multi-region
● switch hosting providers
● rewrite part of your stack in another framework or programming language
Changing architecture is usually a hard slog.
● Takes longer than expected
● Opportunity cost of using that time to create revenue-driving features

Questions to Decide Whether to Proceed
● Is this causing frequent real-world bottlenecks, or are you anticipating / predicting?
○ If it’s not yet causing bottlenecks, can you delay the re-architecture?
● If this is causing bottlenecks but they are infrequent (e.g. once per month), is there
a way to lessen the pressure?
○ E.g. Direct part of the traﬃc to another service?
● If you delay the re-architecture, does it become vastly harder later?
○ Examples of product areas that are hard to change with additional scale (and
thus you might want to do the re-architecture while it’s easier to change):
■ Login methods
■ Database technology

If There’s Internal Debate...
If there’s ﬁerce internal debate about the re-architecture:
● If there’s disagreement, can you port one less-contentious feature so that you have
real-world data to discuss?
○ E.g. Niantic ported the user account system (lower QPS) before moving
databases for entire products.
● Make a detailed time-estimate listing every sub-task (with a one-week granularity).
○ This tells everyone the “price” in development time, so they can make the
cost-beneﬁt tradeoff.
○ Sometimes, re-architecture happens when the team has guess-timated how
long it will take, and has estimated too low.

After You’ve Embarked
After you’ve made the decision to do the re-architecture:
● Look for ways to do the re-architecture one piece at a time and derive beneﬁt, rather
than a wholesale rewrite that will be much harder to coordinate.
● Follow the detailed cost-estimate you made (referenced on the last slide), so that
you can tell each week whether you’re on track schedule-wise.

Would love to hear from you!
niniane@gmail.com

Ensuring Your Technology Will Scale

More Related Content

Similar to Ensuring Your Technology Will Scale

Recently uploaded

Ensuring Your Technology Will Scale