SlideShare a Scribd company logo
Ensuring Your Technology Will
Scale
Niniane Wang
Basis Set Ventures
October 2019
Speaker Background
VP of Engineering at Niantic (acquired my startup)
Board Member at Serena & Lily
Advisor to Basis Set Ventures
Founder / CEO of startup Evertoon
CTO of Minted
Led Gmail Ads eng, cofounded Google Desktop (75M active users)
Eng manager on Microsoft Flight Simulator
Graduated from Caltech in computer science at age 18
Technology Stacks
Single shared game
world, with
geographically indexed
database tables.
● Java
● Datastore, Spanner
● Running on Google
Cloud Platform
Commerce platform with
algorithmically predicted
crowdsourced designs:
● Python
● Flask microframework
● MySQL
● Dedicated Rackspace
(at the time)
Various services:
● Java
● Borg
● Internal Google Cloud
Platform
Traffic Spikes
Each service I’ve worked on has
experienced spikes.
Niantic:
● Launch of Pokemon GO, Harry
Potter: Wizards Unite
Minted:
● Television appearances
Google:
● Launch of any Google product
Pokémon GO launch
Agenda
● Optimize your loadtesting
● How to handle the unexpected
● Tips for working with third-party dependencies
● The tough decision of when to re-architect
Optimizing Loadtesting
Loadtesting: Review of Metholodogy
Set up sequences of API calls based on common user journeys
○ Simulate a user doing the core experience of your product
○ Simulate a situation likely to cause contention, similar to Google Docs
with 100 simultaneous editors
● Use a tool such as Apache Bench to simulate simultaneous users
● We spun up hundreds of server instances to simulate users
When Traditional Loadtesting Works Best
Pros:
● Can re-run any time at your
convenience.
● This means you can make fixes and
then run the loadtest again.
● This is well-suited to the early stages
of doing scalability work, when
there’s many bottlenecks to uncover.
Cons
● Inevitably will have some
discrepancy from actual user API
calling patterns
● Won’t simulate the variety of user
locations, cache hits / misses,
different user devices
Using Real Users to Expose Bottlenecks
After you’ve gotten past the initial bottlenecks, it’s time to more closely simulate real user
traffic.
In your beta, you can reduce your server resources, to expose bottlenecks.
● For example, let’s say you expect to serve 1M users with 200 servers.
● Open your beta to 20,000 users, and use 4 servers.
● Hold an event that encourages simultaneous use.
● Or take one server out of rotation, and look at which resource gets
bottlenecked first.
Traffic Shadowing
You can send real user traffic to an unlaunched service, using asynchronous calls.
This mimics real traffic patterns:
● Which edge servers are hit
● CDN and other caches
If the unlaunched service goes down, make sure that real user traffic won’t get affected.
function doOperationA () {
…
start asynchronous thread to do operation B
…
return value for operation A
Handling the Unexpected
Give Real-time Levers to Your Future Self
Most bottlenecks you encounter will be unexpected. (If they were
expected, you can fix them pre-launch.)
Q: How can you react quickly to unexpected problems?
A: Give tools to your future self:
1. Charts to visualize server metrics
2. Real-time levers to reduce resource contention
Example 1: Directing Traffic to Servers
In one situation, we split traffic across servers:
Ways of distributing traffic across servers:
● Round robin
● Algorithm to keep similar traffic on the same server, to reduce inter-server calls
We created a switch that can move between methods. If the algorithm was too uneven,
we could switch to round robin.
Example 2: Reducing Database Contention
For database contention, a certain high-traffic startup used a trick to count views, after a
page was getting a lot of views:
● 25% of the time, they added 3 views
● 25% of the time, they added 1 view
● 50% of the time, they didn’t add any views
This isn’t precise, but it gave them the option of cutting database calls in half during
high-traffic times.
Example 3: Config Value for API Call
In one situation, each user had a token that needed to be refreshed every X minutes.
We made X a real-time configurable value.
When there was a bottleneck on refreshing the token, X can be increased in real-time.
Types of Levers to Consider
● Feature flags to temporarily disable non-critical features
● Config values that control the frequency of API calls, token
refreshes, periodic jobs
● Ability to add additional servers to a pool
● Reduce API calls from clients (e.g. mobile apps) in the wild
○ E.g. Retry loops
Resources that Could Become Bottlenecks
Here are the most common resource contentions. What levers can you add?
● Database contention (e.g. volume of queries, hotspots)
● Worker pools and worker queues
● Disk space
● Memory
● CPU
● Third-party partner/vendor dependencies
Practice in Advance
Practice these and document the steps & pitfalls:
● Restore data from backup
● Reboot one specific server
● Add a server into the rotation
● Failover to another zone
● … other custom emergencies based on your architecture ...
Third-Party Dependencies
Third Party Dependencies
You will always have some reliance on third-party vendors:
● Login
● Analytics
● Social services
● Marketing services
● Customer service, e.g. live chat
● ...
Vendor Reassurances
Vendors will reassure you:
“We have bigger customers than you. We can handle 50 times your traffic level.”
“Black Friday is make-or-break for us, and we’ve gone all-out to prepare.”
“If we couldn’t scale up to meet demand, we wouldn’t have any customers.”
My advice:
1. Align incentives via SLAs in the contract.
2. Do test runs.
Align Incentives via Money Behind The Promises
Write into the contract that you get a refund if they fail their SLA (Service Level
Agreement).
An example SLA that I like to ask for:
● 10% refund below 99.9% uptime (43 minutes outage in a month)
● 25% refund below 99.7% uptime (2.2 hours in a month)
● 50% refund below 99.5% uptime (3.6 hours in a month)
Negotiation of SLA
Common negotiation rebuttals from the vendor:
● “We use a standard contract, and we don’t put refunds in any of our contracts.”
○ In my experience, they always ended up putting the SLA into the contract.
○ Occasionally they want a carveout for outages that are not their fault (e.g. if
AWS went down). I added the carveout.
● “Instead of a refund, we’ll release you from the contract if we fail the SLA.”
○ It is a big investment for you to switch vendors. They should pay the cost for
their outage, not your engineering team.
● “You can talk to our CTO & our customer who will tell you that our uptime is great.”
○ That’s good, but there’s no substitute for the SLA in the contract.
Making the SLA Count
If their uptime falls below SLA, always ask for the refund.
Even if you lost $1M due to their outage, and the refund is only $5,000, ask for the refund.
When they give you the refund, that refund will show up on their P&L, which will unite
their executives, PMs, and Board of Directors with providing good service for you.
If they don’t have to give a refund, they will be torn between signing up new customers vs
fixing this issue for existing customers.
Test out the Outage Reporting Process
Do test-runs of reporting an outage.
This can often be helpful to the vendor too. They may realize they need to have better
escalation or handoff procedures, or that they need to improve the training to their
technical-support staff.
Conduct a test during your beta, so that you are familiar with the process and they can
iron out the wrinkles.
Re-Architecting
Deciding Trade-off
There will always be “more that you can do” to prepare for scalability.
One tough decision is in changing architecture, e.g.
● switch to a more performant database or CDN
● go multi-zone or multi-region
● switch hosting providers
● rewrite part of your stack in another framework or programming language
Changing architecture is usually a hard slog.
● Takes longer than expected
● Opportunity cost of using that time to create revenue-driving features
Questions to Decide Whether to Proceed
● Is this causing frequent real-world bottlenecks, or are you anticipating / predicting?
○ If it’s not yet causing bottlenecks, can you delay the re-architecture?
● If this is causing bottlenecks but they are infrequent (e.g. once per month), is there
a way to lessen the pressure?
○ E.g. Direct part of the traffic to another service?
● If you delay the re-architecture, does it become vastly harder later?
○ Examples of product areas that are hard to change with additional scale (and
thus you might want to do the re-architecture while it’s easier to change):
■ Login methods
■ Database technology
If There’s Internal Debate...
If there’s fierce internal debate about the re-architecture:
● If there’s disagreement, can you port one less-contentious feature so that you have
real-world data to discuss?
○ E.g. Niantic ported the user account system (lower QPS) before moving
databases for entire products.
● Make a detailed time-estimate listing every sub-task (with a one-week granularity).
○ This tells everyone the “price” in development time, so they can make the
cost-benefit tradeoff.
○ Sometimes, re-architecture happens when the team has guess-timated how
long it will take, and has estimated too low.
After You’ve Embarked
After you’ve made the decision to do the re-architecture:
● Look for ways to do the re-architecture one piece at a time and derive benefit, rather
than a wholesale rewrite that will be much harder to coordinate.
● Follow the detailed cost-estimate you made (referenced on the last slide), so that
you can tell each week whether you’re on track schedule-wise.
Would love to hear from you!
niniane@gmail.com

More Related Content

Similar to Ensuring Your Technology Will Scale

Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Agile and fixed budget projects
Agile and fixed budget projectsAgile and fixed budget projects
Agile and fixed budget projects
Gul Mohammad
 
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature EnvironmentPuppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet
 
Follow the evidence: Troubleshooting Performance Issues
Follow the evidence:  Troubleshooting Performance IssuesFollow the evidence:  Troubleshooting Performance Issues
Follow the evidence: Troubleshooting Performance Issues
Salesforce Developers
 
Project_Estimation
Project_EstimationProject_Estimation
Project_Estimation
Naeem Bari
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
DevopsBusinessCaseTemplate
DevopsBusinessCaseTemplateDevopsBusinessCaseTemplate
DevopsBusinessCaseTemplate
Peter Lamar
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
SkyPlanner
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
Identity management delegation and automation
Identity management delegation and automationIdentity management delegation and automation
Identity management delegation and automation
Bill Buchan
 
Qtp interview questions_1
Qtp interview questions_1Qtp interview questions_1
Qtp interview questions_1
Ramu Palanki
 
Grails services
Grails servicesGrails services
Grails services
Vijay Shukla
 
Blitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village StageBlitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village Stage
Greylock Partners
 
Agile for product owners v12
Agile for product owners  v12Agile for product owners  v12
Agile for product owners v12
Ravi Tadwalkar
 
Event-Driven Architectures Done Right | Tim Berglund, Confluent
Event-Driven Architectures Done Right | Tim Berglund, ConfluentEvent-Driven Architectures Done Right | Tim Berglund, Confluent
Event-Driven Architectures Done Right | Tim Berglund, Confluent
HostedbyConfluent
 
2 years into drinking the Microservice kool-aid (Fact and Fiction)
2 years into drinking the Microservice kool-aid (Fact and Fiction)2 years into drinking the Microservice kool-aid (Fact and Fiction)
2 years into drinking the Microservice kool-aid (Fact and Fiction)
roblund
 
2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...
2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...
2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...
Douglas English
 
Automate All The Things with Flow
Automate All The Things with FlowAutomate All The Things with Flow
Automate All The Things with Flow
Salesforce Admins
 
3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile
Neotys
 

Similar to Ensuring Your Technology Will Scale (20)

Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Agile and fixed budget projects
Agile and fixed budget projectsAgile and fixed budget projects
Agile and fixed budget projects
 
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature EnvironmentPuppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
 
Follow the evidence: Troubleshooting Performance Issues
Follow the evidence:  Troubleshooting Performance IssuesFollow the evidence:  Troubleshooting Performance Issues
Follow the evidence: Troubleshooting Performance Issues
 
Project_Estimation
Project_EstimationProject_Estimation
Project_Estimation
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
DevopsBusinessCaseTemplate
DevopsBusinessCaseTemplateDevopsBusinessCaseTemplate
DevopsBusinessCaseTemplate
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Identity management delegation and automation
Identity management delegation and automationIdentity management delegation and automation
Identity management delegation and automation
 
Qtp interview questions_1
Qtp interview questions_1Qtp interview questions_1
Qtp interview questions_1
 
Grails services
Grails servicesGrails services
Grails services
 
Blitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village StageBlitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village Stage
 
Agile for product owners v12
Agile for product owners  v12Agile for product owners  v12
Agile for product owners v12
 
Event-Driven Architectures Done Right | Tim Berglund, Confluent
Event-Driven Architectures Done Right | Tim Berglund, ConfluentEvent-Driven Architectures Done Right | Tim Berglund, Confluent
Event-Driven Architectures Done Right | Tim Berglund, Confluent
 
2 years into drinking the Microservice kool-aid (Fact and Fiction)
2 years into drinking the Microservice kool-aid (Fact and Fiction)2 years into drinking the Microservice kool-aid (Fact and Fiction)
2 years into drinking the Microservice kool-aid (Fact and Fiction)
 
2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...
2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...
2017 Melbourne YOW! CTO Summit - Monolith to micro-services with CQRS & Event...
 
Automate All The Things with Flow
Automate All The Things with FlowAutomate All The Things with Flow
Automate All The Things with Flow
 
3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile
 

Recently uploaded

Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
OBD II
 
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxxARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
alemaro1123
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
IIIT Hyderabad
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
celiosilva66
 
readers writers Problem in operating system
readers writers Problem in operating systemreaders writers Problem in operating system
readers writers Problem in operating system
VADAPALLYPRAVEENKUMA1
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
rawankhanlove256
 
Top EPC companies in India - Best EPC Contractor
Top EPC companies in India - Best EPC  ContractorTop EPC companies in India - Best EPC  Contractor
Top EPC companies in India - Best EPC Contractor
MangeshK6
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
RECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptxRECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptx
peacesoul123
 
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmtlecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
RAtna29
 
Software Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity PlanningSoftware Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity Planning
Prakhyath Rai
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
idelewebmestre
 
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptxIE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
BehairyAhmed2
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
leakingvideo
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
mdfkobir
 
Traffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptxTraffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptx
mailmad391
 

Recently uploaded (20)

Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
 
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxxARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
 
readers writers Problem in operating system
readers writers Problem in operating systemreaders writers Problem in operating system
readers writers Problem in operating system
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Top EPC companies in India - Best EPC Contractor
Top EPC companies in India - Best EPC  ContractorTop EPC companies in India - Best EPC  Contractor
Top EPC companies in India - Best EPC Contractor
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
 
RECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptxRECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptx
 
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmtlecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
 
Software Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity PlanningSoftware Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity Planning
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
 
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptxIE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
 
Traffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptxTraffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptx
 

Ensuring Your Technology Will Scale

  • 1. Ensuring Your Technology Will Scale Niniane Wang Basis Set Ventures October 2019
  • 2. Speaker Background VP of Engineering at Niantic (acquired my startup) Board Member at Serena & Lily Advisor to Basis Set Ventures Founder / CEO of startup Evertoon CTO of Minted Led Gmail Ads eng, cofounded Google Desktop (75M active users) Eng manager on Microsoft Flight Simulator Graduated from Caltech in computer science at age 18
  • 3. Technology Stacks Single shared game world, with geographically indexed database tables. ● Java ● Datastore, Spanner ● Running on Google Cloud Platform Commerce platform with algorithmically predicted crowdsourced designs: ● Python ● Flask microframework ● MySQL ● Dedicated Rackspace (at the time) Various services: ● Java ● Borg ● Internal Google Cloud Platform
  • 4. Traffic Spikes Each service I’ve worked on has experienced spikes. Niantic: ● Launch of Pokemon GO, Harry Potter: Wizards Unite Minted: ● Television appearances Google: ● Launch of any Google product Pokémon GO launch
  • 5. Agenda ● Optimize your loadtesting ● How to handle the unexpected ● Tips for working with third-party dependencies ● The tough decision of when to re-architect
  • 7. Loadtesting: Review of Metholodogy Set up sequences of API calls based on common user journeys ○ Simulate a user doing the core experience of your product ○ Simulate a situation likely to cause contention, similar to Google Docs with 100 simultaneous editors ● Use a tool such as Apache Bench to simulate simultaneous users ● We spun up hundreds of server instances to simulate users
  • 8. When Traditional Loadtesting Works Best Pros: ● Can re-run any time at your convenience. ● This means you can make fixes and then run the loadtest again. ● This is well-suited to the early stages of doing scalability work, when there’s many bottlenecks to uncover. Cons ● Inevitably will have some discrepancy from actual user API calling patterns ● Won’t simulate the variety of user locations, cache hits / misses, different user devices
  • 9. Using Real Users to Expose Bottlenecks After you’ve gotten past the initial bottlenecks, it’s time to more closely simulate real user traffic. In your beta, you can reduce your server resources, to expose bottlenecks. ● For example, let’s say you expect to serve 1M users with 200 servers. ● Open your beta to 20,000 users, and use 4 servers. ● Hold an event that encourages simultaneous use. ● Or take one server out of rotation, and look at which resource gets bottlenecked first.
  • 10. Traffic Shadowing You can send real user traffic to an unlaunched service, using asynchronous calls. This mimics real traffic patterns: ● Which edge servers are hit ● CDN and other caches If the unlaunched service goes down, make sure that real user traffic won’t get affected. function doOperationA () { … start asynchronous thread to do operation B … return value for operation A
  • 12. Give Real-time Levers to Your Future Self Most bottlenecks you encounter will be unexpected. (If they were expected, you can fix them pre-launch.) Q: How can you react quickly to unexpected problems? A: Give tools to your future self: 1. Charts to visualize server metrics 2. Real-time levers to reduce resource contention
  • 13. Example 1: Directing Traffic to Servers In one situation, we split traffic across servers: Ways of distributing traffic across servers: ● Round robin ● Algorithm to keep similar traffic on the same server, to reduce inter-server calls We created a switch that can move between methods. If the algorithm was too uneven, we could switch to round robin.
  • 14. Example 2: Reducing Database Contention For database contention, a certain high-traffic startup used a trick to count views, after a page was getting a lot of views: ● 25% of the time, they added 3 views ● 25% of the time, they added 1 view ● 50% of the time, they didn’t add any views This isn’t precise, but it gave them the option of cutting database calls in half during high-traffic times.
  • 15. Example 3: Config Value for API Call In one situation, each user had a token that needed to be refreshed every X minutes. We made X a real-time configurable value. When there was a bottleneck on refreshing the token, X can be increased in real-time.
  • 16. Types of Levers to Consider ● Feature flags to temporarily disable non-critical features ● Config values that control the frequency of API calls, token refreshes, periodic jobs ● Ability to add additional servers to a pool ● Reduce API calls from clients (e.g. mobile apps) in the wild ○ E.g. Retry loops
  • 17. Resources that Could Become Bottlenecks Here are the most common resource contentions. What levers can you add? ● Database contention (e.g. volume of queries, hotspots) ● Worker pools and worker queues ● Disk space ● Memory ● CPU ● Third-party partner/vendor dependencies
  • 18. Practice in Advance Practice these and document the steps & pitfalls: ● Restore data from backup ● Reboot one specific server ● Add a server into the rotation ● Failover to another zone ● … other custom emergencies based on your architecture ...
  • 20. Third Party Dependencies You will always have some reliance on third-party vendors: ● Login ● Analytics ● Social services ● Marketing services ● Customer service, e.g. live chat ● ...
  • 21. Vendor Reassurances Vendors will reassure you: “We have bigger customers than you. We can handle 50 times your traffic level.” “Black Friday is make-or-break for us, and we’ve gone all-out to prepare.” “If we couldn’t scale up to meet demand, we wouldn’t have any customers.” My advice: 1. Align incentives via SLAs in the contract. 2. Do test runs.
  • 22. Align Incentives via Money Behind The Promises Write into the contract that you get a refund if they fail their SLA (Service Level Agreement). An example SLA that I like to ask for: ● 10% refund below 99.9% uptime (43 minutes outage in a month) ● 25% refund below 99.7% uptime (2.2 hours in a month) ● 50% refund below 99.5% uptime (3.6 hours in a month)
  • 23. Negotiation of SLA Common negotiation rebuttals from the vendor: ● “We use a standard contract, and we don’t put refunds in any of our contracts.” ○ In my experience, they always ended up putting the SLA into the contract. ○ Occasionally they want a carveout for outages that are not their fault (e.g. if AWS went down). I added the carveout. ● “Instead of a refund, we’ll release you from the contract if we fail the SLA.” ○ It is a big investment for you to switch vendors. They should pay the cost for their outage, not your engineering team. ● “You can talk to our CTO & our customer who will tell you that our uptime is great.” ○ That’s good, but there’s no substitute for the SLA in the contract.
  • 24. Making the SLA Count If their uptime falls below SLA, always ask for the refund. Even if you lost $1M due to their outage, and the refund is only $5,000, ask for the refund. When they give you the refund, that refund will show up on their P&L, which will unite their executives, PMs, and Board of Directors with providing good service for you. If they don’t have to give a refund, they will be torn between signing up new customers vs fixing this issue for existing customers.
  • 25. Test out the Outage Reporting Process Do test-runs of reporting an outage. This can often be helpful to the vendor too. They may realize they need to have better escalation or handoff procedures, or that they need to improve the training to their technical-support staff. Conduct a test during your beta, so that you are familiar with the process and they can iron out the wrinkles.
  • 27. Deciding Trade-off There will always be “more that you can do” to prepare for scalability. One tough decision is in changing architecture, e.g. ● switch to a more performant database or CDN ● go multi-zone or multi-region ● switch hosting providers ● rewrite part of your stack in another framework or programming language Changing architecture is usually a hard slog. ● Takes longer than expected ● Opportunity cost of using that time to create revenue-driving features
  • 28. Questions to Decide Whether to Proceed ● Is this causing frequent real-world bottlenecks, or are you anticipating / predicting? ○ If it’s not yet causing bottlenecks, can you delay the re-architecture? ● If this is causing bottlenecks but they are infrequent (e.g. once per month), is there a way to lessen the pressure? ○ E.g. Direct part of the traffic to another service? ● If you delay the re-architecture, does it become vastly harder later? ○ Examples of product areas that are hard to change with additional scale (and thus you might want to do the re-architecture while it’s easier to change): ■ Login methods ■ Database technology
  • 29. If There’s Internal Debate... If there’s fierce internal debate about the re-architecture: ● If there’s disagreement, can you port one less-contentious feature so that you have real-world data to discuss? ○ E.g. Niantic ported the user account system (lower QPS) before moving databases for entire products. ● Make a detailed time-estimate listing every sub-task (with a one-week granularity). ○ This tells everyone the “price” in development time, so they can make the cost-benefit tradeoff. ○ Sometimes, re-architecture happens when the team has guess-timated how long it will take, and has estimated too low.
  • 30. After You’ve Embarked After you’ve made the decision to do the re-architecture: ● Look for ways to do the re-architecture one piece at a time and derive benefit, rather than a wholesale rewrite that will be much harder to coordinate. ● Follow the detailed cost-estimate you made (referenced on the last slide), so that you can tell each week whether you’re on track schedule-wise.
  • 31. Would love to hear from you! niniane@gmail.com