While many sites and services crash these days due to extremely intensive and unplanned load caused by high traffic, it is a great opportunity to learn. This is a live virtual lecture about system scalability. I will share a story about small company that growing up, facing challenges at each stage and solving them by applying various scalability patterns and decomposes its monolithic system into distributed Microservices.
2. Intuit Confidential and Proprietary 2
Agenda
Intro
Use Case
Fundamentals
Scaling Databases
Scaling API
Async Operations
Security in Cloud
Incident Management
Q&A
6. Intuit Confidential and Proprietary 6
Where we startWhere we started
Intuit’s journey began over 35 years ago when our founder Scott Cook sat at his
kitchen table and watched his wife as she balanced their checkbook and thought
there must be a better way.
8. Intuit Confidential and Proprietary 8
Who we are
Founded
9,000
Employees
50M
Customers
1993
IPO
~$6.8B
FY19
Revenue
20
Locations
1983
9. Intuit Confidential and Proprietary 9
Intuit global locations 20 locations in 9 countries
Brazil
São Paulo
Europe
London, UK
Paris, France
Australia
Melbourne
Sydney
India
Bangalore
Israel
Tel Aviv
United States
California:
Los Angeles
Mountain View
San Diego
San Francisco
Boise, ID
Fredericksburg, VA
Plano/Dallas, TX
Reno, NV
Tucson, AZ
Washington, D.C.
Canada
Edmonton
Mississauga/Toronto
Updated November 2019
Mexico
Mexico City
10. Intuit Confidential and Proprietary 10
Recognized as one of the world’s leading companies
2004 - 2019
Most Admired:
Computer Software
2002 - 2019
100 Best Companies
to Work For
2019
Most Innovative
Companies
2019
Companies Best
Positioned For
Breakout Growth
11. Intuit Confidential and Proprietary 11
Recognized as one of the top companies to work for
#15 IN THE BAY AREA
#10 IN TECHNOLOGY
#1 in Canada
#24 in the US
#4 in the UK
#2 in India
#3 in Australia
#15 COMPANIES THAT CARE #14 in Israel
12. Intuit Confidential and Proprietary 12
Be
Bold
Be
Passionate
Learn
Fast
Be
Decisive
Win
Together
Deliver
Awesome
Our values
Integrity Without
Compromise
We Care and Give Back
13. Intuit Confidential and Proprietary 13
Core capabilities: Our recipe to execute with excellence
WHAT TO SOLVE HOW TO SOLVE
CUSTOMER-DRIVEN INNOVATION (CDI) DESIGN FOR DELIGHT (D4D)
An important,
unsolved
customer
problem
…that we, and
those we
enable, can
solve well
…and build durable
competitive advantage
SUCCESS
IS
HERE
Deep customer
empathy
Go broad to
go narrow
Rapid experiments
with customers
DELIGHT
15. Intuit Confidential and Proprietary 15
Hermes Deliveries
- Assume we have a small startup company that offers
low cost and fast deliveries by connecting people and
delivery needs.
- Initially when we start, our target is Rishon Le-Tzion and
we have tens of daily customers and 2-3 couriers.
16. Intuit Confidential and Proprietary 16
Hermes Deliveries
- We save all customers, couriers, orders, places, and history in the same database, in a
single DB instance.
- Everything is new and small, so we have no caching, no automation, no auto scaling, no
monitoring.
- This is perfect as we hardly have 1 delivery order in 15 minutes.
17. Intuit Confidential and Proprietary 17
COVID-19 time
- But, with COVID-19 more and more people started
using our service since we are the cheapest and
fastest service.
- Now, we have 20 orders in 30 minutes and the
number increases…
- We see how successful the service is and decided to
offer our services in Tel Aviv.
18. Intuit Confidential and Proprietary 18
Turtle Deliveries
- At this point we realize that our system performs
poorly.
- The App works very slow and API latency increased.
- We have database transaction deadlocks and whole
system failures.
23. Intuit Confidential and Proprietary 23
The Scale Cube
Data partitioning
Scale by splitting similar things [sharding]
Example: Cell Architecture and/or Sharding. A cell
is a self-contained installation that can satisfy all the
operations for a shard. A shard is a subset of a
much larger dataset, typically a range of users, for
example.
Horizontal duplication
Scale by cloning
Example: 18 Web servers under Load Balancer
Functional decomposition
Scale by splitting different things
Example: Orders, Inventory, Customers
25. Intuit Confidential and Proprietary 25
Query Optimization, Indexing and Connection Pool
- We use RDBMS which is heavily normalized
- Therefore, it was decided to
1. Introduce some redundant columns which frequently appear in WHERE and JOIN ON clauses
(denormalization)
- This will reduce join queries and break few big queries into smaller.
2. Introduce index to columns that frequently appear in WHERE clauses.
3. Use connection pool for optimization of the number of costly network connections.
26. Intuit Confidential and Proprietary 26
Vertical Scaling or Scaling Up
- All that helped improve application API latency by 30% which
was good enough at this time.
- We entered new areas.
- It was also decided to upgrade RDS and add more storage for
reducing future risks.
27. Intuit Confidential and Proprietary 27
New Challenges
- Everything is running great, we have more orders and
delivery couriers, but facing new issues… again…
1. Database index grows and requires maintenance
2. Table scanning with index is slow.
- Upgrade to a bigger RDS instance is costly and we are not
yet profitable.
- What is your next step?
28. Intuit Confidential and Proprietary 28
Read Replicas
- Bigger RDS is not able to handle all READ and WRITE requests.
- In most cases we need consistency in WRITEs, but small delays on READs are fine.
- Therefore, it was decided to create two READ replicas of a given source RDS instance,
thereby increasing read throughput.
29. Intuit Confidential and Proprietary 29
Read Replicas
- That was good and we decided to go for new areas.
- Now we see that the Primary instance is not able to handle all
writes and there is latency.
- We also have unacceptable lags between Primary and READ
replicas.
- What’s next?
30. Intuit Confidential and Proprietary 30
Functional Decomposition
- Our Locations table in database is getting high WRITE
traffic - the R:W ratio is 3:8.
- That table is used for location tracking and it has nothing
to do with the rest of the functionality.
- Why not decompose functionality?
1. Separate Locations table to a new dedicated database
2. Decouple location tracking functionality as a stand-alone
Microservice?
31. Intuit Confidential and Proprietary 31
Functional Decomposition
- Done deal! It works!
- Now we want to add the rest of the country and we must plan for
a scale
- What can we do?
32. Intuit Confidential and Proprietary 32
Data Partitioning or Sharding
Share Nothing Model
- It was decided to Shard the database.
- Sharding is a technique that splits data into smaller subsets and distributes them across a number of
physically separated database servers.
- Shards have no knowledge of each other.
If one database shard has a hardware
issue or goes through failover, no
other shards are impacted
The query to read or join data from
multiple database shards must be
specially engineered
North Center South
33. Intuit Confidential and Proprietary 33
Wrap Up and Questions
- Scalability Cube and Concepts
- Scaling Databases
1. Query Optimization, Indexing, Connection Pool
2. Vertical Scaling / Scaling Up
3. Read Replicas
4. Data Partitioning or Sharding
- Functional Decomposition
35. Intuit Confidential and Proprietary 35
Horizontal Scaling
- We noticed that our Location API server is extremely loaded and sometimes crashes.
- We decided to create three Location API servers in three AWS availability zones and put
them behind load balancer.
36. Intuit Confidential and Proprietary 36
Auto Scaling
- But we still face occasional issues during certain hours and we want to be smart with costs.
- Therefore, it was decided to use auto scaling.
Minimum Size Scale out as needed
Desired Capacity
Maximum Size
Application to be live
Application to perform normal
During peak season
During peak load, scale out by
- Minimum 3 servers
- Desired 9 servers
- Maximum 27 servers
- 3 servers, every 10 minutes
Example:
We can also
consider reserved
instances for cost
saving
Very important to autoscale not
only using HW utilization (CPU,
memory, etc.), but also use
system metrics like lag in your
messaging broker for example.
37. Intuit Confidential and Proprietary 37
Customer Attrition
- So far everything was good, but we noticed some customer attrition.
- We know that our pricing is personal, dynamic and cheapest in the market.
- Customers come as usual, fill-in details, ask for price quote and then go away...
- What could that be?
38. Intuit Confidential and Proprietary 38
Customer Attrition
Reason
- Turns out that our unique and sophisticated pricing and matching algorithm has some
expensive “personality coefficient” calculation per each delivery courier.
- We want the fastest and nicest couriers to be rewarded a bit more and we look at the
history for calculating “personality coefficient” used in pricing.
39. Intuit Confidential and Proprietary 39
Customer Attrition
Reason
Therefore, we decided to decompose “personality coefficient” as a separate offline job and create Pricing
service
- The calculation will happen once in 24 hours per each courier
- For fast performance we decided to use Redis as a key/value caching data store for saving of the results
41. Intuit Confidential and Proprietary 41
Three Important Response Time Limits
- Up to 0.1 seconds: The user doesn’t recognize any perceptible delay.
- Up to 1 second: The delay is slightly perceptible. The user feels a
pause, the site may feel sluggish.
- Up to 10 seconds: With an operation that takes 10 seconds or more
to complete, you’ll lose the user’s attention (unless you give them
feedback).
42. Intuit Confidential and Proprietary 42
Long Courier Matching
- One of the reasons that people love Hermes Deliveries is our great user experience.
○ For example, we don’t require customer registration and collect only minimum information at
the time of the order - progressive data collection.
- Sometimes it takes more than 10 seconds to find the best matching courier.
○ As a result, some of our customers leave although we do show progress bar.
- How can we solve this?
43. Intuit Confidential and Proprietary 43
Perceived Performance
Our Mission
- Requirement 1: Best courier matching
- Requirement 2: Do matching in less than 1 sec
- Rule: No trade-offs
44. Intuit Confidential and Proprietary 44
Perceived Performance
Separation in Time Principle
- If a system or process must satisfy contradictory requirements try to schedule the system
operation in such a way that requirements in conflict take effect at different times.
- Perceived Performance refers to how quickly a software feature appears to perform its
task.
45. Intuit Confidential and Proprietary 45
Perceived Performance
Solution
- We can anticipate customer behavior based on the history, similar customers and knowing their
location we can match couriers in the background.
- We also know “personality coefficient” that we calculated and cached, so we can calculate price
instantly.
- The moment customers click on “order” we already have everything to start the ordering process.
46. Intuit Confidential and Proprietary 46
Perceived Performance
Architecture
This is how we started
47. Intuit Confidential and Proprietary 47
Money Movement
- So far we were working with a 3rd party payment processor Plutos who are responsible for
charging customers and paying couriers:
1. During ordering process we are calling Plutos API and providing Customer credit card to charge
from and Payee ID of courier to pay to
2. All couriers are registered customers of Plutos and they can withdraw money to one of their
preferred payment channels
3. All that happens online
48. Intuit Confidential and Proprietary 48
Money Movement
Challenge
- Issues with customer credit card -
○ wrong number, insufficient funds etc.
- Plutos are not reliable service and we are losing money and customers because of that.
- What can we do?
49. Intuit Confidential and Proprietary 49
Online vs. Offline or Sync vs. Async
- Online / Sync
1. Collect customer credit card, verify and authorize funds by calling Pluto API
2. If Pluto fails have another (maybe more expensive) provider for resiliency
3. Store details for offline processing
4. Release customer
- Offline / Async
5. Do Money Movement using Pluto or another provider
51. Intuit Confidential and Proprietary 51
Wrap Up and Questions
- Horizontal Scaling
- Auto Scaling
- Functional Decomposition
- Offline Jobs
- Three Important Response Time Limits
- Perceived Performance
- Async vs. Sync
53. Intuit Confidential and Proprietary 53
Shared Responsibility Principle
While AWS (or any Cloud vendor) manages security of the Cloud, security in the
Cloud is the responsibility of the customer
54. Intuit Confidential and Proprietary 54
Blast Radius
How widespread the threat or failure is?
Goal: minimize blast radius
55. Intuit Confidential and Proprietary 55
Tactics
- Multiple Cloud account security strategy
○ Accounts per organization, department, product, etc.
- Segmentation of network, data, storage etc.
- Access Control policies and least privilege principle
56. Intuit Confidential and Proprietary 56
Tactics
- Data Classification and Handling strategy and standard
○ Data can be public, restricted, sensitive, secret etc.
○ For each type of data have a proper standard how to handle at rest, in flight, on screen etc.
- “Dance Like Nobody’s Watching. Encrypt Like Everyone Is.”
- Keys and Secret Management
○ Multiple Keys – preferably key per customer, user, record, column etc.
○ Creation, Rotation etc.
○ Secret Vault
57. Intuit Confidential and Proprietary 57
Tactics
- Security Reviews and mindset in teams
- Automations
○ Find secrets or sensitive data in Github, Logs, Data
Stores, Customer free text entries such as “comments” or
“descriptions”
○ S3 open for public
○ Policy violations
○ Application scans as a part of CICD such as OWASP Top
10 Risks, 3rd Party and Open Source dependency scans
○ Docker Image vulnerabilities
58. Intuit Confidential and Proprietary 58
Tactics
- Have Security Incident response plan
- Don’t put all eggs in one basket
- Don’t trust anything and anyone
60. Intuit Confidential and Proprietary 60
Incidents are
Unavoidable
So you better have a
proper response plan
and data
61. Intuit Confidential and Proprietary 61
Incident Response Practices
- Have on-call procedure and discipline
- Have monitoring and prioritized alerting system
- Escalate and declare incidents early and often
- During incident identify potential root cause and bring required experts if/when required
- Assess customer impact and communicate
- “Stop the bleeding” first
- Preserve everything you might need for post-mortem root cause analysis and track times
- Define next steps and conduct post-mortem root cause analysis