How We Scaled Freshdesk to
Handle 150M Requests/Week
Kiran Darisi
Director, Technical Operations at Freshdesk
Our customer base grew by 400% and the number of requests
per week boomed from 10 to 65 million in a year (2013).
Not from an engineering perspective
Cool for a 3 year old startup?
We used a bunch of methods to scale vertically in a really
short amount of time.
Sure, we eventually had to shard our databases, but some
of these techniques helped us stay afloat, for quite a while.
MOORE’S WAY
Increasing the RAM, CPU and I/O
But the amount of RAM we added and the
CPU cycles did not correlate with the
workload we got out of the instance. So we
stayed put at 64GB.
We upgraded from Medium Instance
Amazon EC2 First Generation to High
Memory Quadruple Extra Large (thus
increasing our RAM from 3.75 GB to 64 GB)
R/W split increased the number of I/Os
performed on our databases but it didn’t
do much for write performance.
We marked dedicated roles for each
slave because using round robin
algorithm to select different slaves for
different queries proved ineffective.
THE READ/WRITE SPLIT
Using MySQL replication and distributing
the reads between master and slave
We chose the partition key and the number
of partitions and the table was partitioned
automatically.
Post-partitioning, our read performance
increased dramatically but again, the write
performance was a problem.
MYSQL PARTITIONING
Using the MySQL 5 built-in
partitioning capability.
1. Choose the partition key carefully or alter the current schema to
follow the MySQL partition rules.
2. The number of partitions you start with will affect the I/O
operations on the disk directly.
3. If you use a hash-based algorithm with hash-based keys, you
cannot control who goes where. This means you’ll be in trouble if
two or more noisy customers fall within the same partition.
4. Make sure that every query contains the MySQL partition key. A
query without the partition key ends up scanning all the
partitions. Performance is sure to take a dive.
Things to keep in mind while performing MySQL partitioning
We cached ActiveRecord objects as well as
HTML partials (bits and pieces of HTML) using
Memcached.
We chose Memcached because it scales well
with multiple clusters. The Memcached client
used makes a lot of difference in response
time so we went with dalli.
CACHING
Caching objects that rarely
change in their lifetime
DISTRIBUTED FUNCTIONS
Keeping response time low by
using different storage engines for
different purposes
We started using Amazon RedShift for
analytics and data mining, and Redis to
store state information and background
jobs for Resque.
But because Redis can’t scale or fallback,
we don’t use it for atomic operations.
We decided that scaling horizontally by sharding was
the only cost-effective way to increase write scalability
beyond the instance size.
But scaling vertically can only get you so far.
Two main concerns we had before we took the final call
on sharding:
1. No distributed transactions – We wanted all tenant
details to be in one shard.
2. Rebalancing the shards should be easy – We wanted
control over which tenant sits in which shard and to
be able to move them around when needed.
A little research showed us that directory-based
sharding was the only way to go.
REASONS FOR
CHOOSING DIRECTORY-
BASED SHARDING
It is simpler than hash key-based
or range-based sharding.
Rebalancing shards is easier here
than in other methods.
A typical directory entry looks like this
tenant info shard_details shard_status
Stark Industries shard1 Read & Write
• tenant_info - unique key referring to the DB entry
• shard_details - shard in which that tenant exists
• shard_status - tells what kind of activity the tenant is ready for (we have
multiple shard statuses like Not Ready, Only Reads, Read & Write etc)
The sharding API even
acts as a unique ID
generator so that the
tenant ID generated is
unique across shards.
How directory lookups work
API wrapper is tuned to
accept the tenant
information in multiple
forms like tenant URL,
tenant ID etc.
Sometimes a customer grows from processing 1000 tickets per day
to 10,000 tickets per day. This will affect the performance of the
whole shard.
We can’t solve this by splitting up customer data into multiple
shards because we didn’t want the mess of distributed transactions.
So, in these cases, we’d move the noisy customer to a shard of his
own. That way, everybody wins.
Why we care about rebalancing
Steps to
Rebalance a Shard
Every shard will have its own slave to scale the reads.
For example, say Wayne Enterprises and Stark
industries are in shard1.
1
Wayne Enterprises shard1 Read & Write
Stark Industries shard1 Read & Write
The directory entry looks like this:
If Wayne enterprises grows at a breakneck
pace, we would decide to move it to
another shard.
(averting the danger of Bruce Wayne and
Tony Stark being mad at us the same time).
2
So we would boot up a new slave to shard1 and call it
shard2. Then, we’d attach a read replica to the new
slave and wait for it to sync with the master.
3
We would then stop the writes for Wayne Enterprises
by changing the shard status in the directory.
4
Wayne Enterprises shard1 Read Only
Stark Industries shard1 Read & Write
Then we would stop the replication of master data in
shard2 and promote it to master.
5
Now the directory entry will be updated accordingly.
Wayne Enterprises shard2 Read & Write
Stark Industries shard1 Read & Write
This effectively moves Wayne Enterprises to its own shard.
Batman is happy and so is Iron Man.
6
1. Don’t do it unless it’s absolutely necessary. You will have to
rewrite code for your whole app, and maintain it.
2. You could use functional partitioning (moving an over-sized table
to another DB altogether) to completely avoid sharding if writes
are not a problem.
3. Choosing the right sharding algorithm is a bit tricky as each has
its own benefits and drawbacks. You need to make a thorough
study of all your requirements while picking one.
4. You will have to take care of the Unique ID generation across
shards.
Word of caution
We get 250,000 tickets across Freshdesk every day and 100 M
queries during the same time (with a peak of 3-4k QPS). We have a
separate shard now for all new sign ups. And each shard can
roughly carry 20,000 tenants.
In the future, we’d like to explore Multi-pod architecture and also
look at Proxy architecture using MySQL Fabric, Scalebase etc.
What’s next for Freshdesk
“Behind every slideshare is a
great blogpost”
Read more about scaling freshdesk here
http://blog.freshdesk.com/how-freshdesk-scaled-using-
sharding/

How We Scaled Freshdesk To Take 65M Requests/week

  • 1.
    How We ScaledFreshdesk to Handle 150M Requests/Week Kiran Darisi Director, Technical Operations at Freshdesk
  • 2.
    Our customer basegrew by 400% and the number of requests per week boomed from 10 to 65 million in a year (2013).
  • 3.
    Not from anengineering perspective Cool for a 3 year old startup?
  • 4.
    We used abunch of methods to scale vertically in a really short amount of time. Sure, we eventually had to shard our databases, but some of these techniques helped us stay afloat, for quite a while.
  • 5.
    MOORE’S WAY Increasing theRAM, CPU and I/O But the amount of RAM we added and the CPU cycles did not correlate with the workload we got out of the instance. So we stayed put at 64GB. We upgraded from Medium Instance Amazon EC2 First Generation to High Memory Quadruple Extra Large (thus increasing our RAM from 3.75 GB to 64 GB)
  • 6.
    R/W split increasedthe number of I/Os performed on our databases but it didn’t do much for write performance. We marked dedicated roles for each slave because using round robin algorithm to select different slaves for different queries proved ineffective. THE READ/WRITE SPLIT Using MySQL replication and distributing the reads between master and slave
  • 7.
    We chose thepartition key and the number of partitions and the table was partitioned automatically. Post-partitioning, our read performance increased dramatically but again, the write performance was a problem. MYSQL PARTITIONING Using the MySQL 5 built-in partitioning capability.
  • 8.
    1. Choose thepartition key carefully or alter the current schema to follow the MySQL partition rules. 2. The number of partitions you start with will affect the I/O operations on the disk directly. 3. If you use a hash-based algorithm with hash-based keys, you cannot control who goes where. This means you’ll be in trouble if two or more noisy customers fall within the same partition. 4. Make sure that every query contains the MySQL partition key. A query without the partition key ends up scanning all the partitions. Performance is sure to take a dive. Things to keep in mind while performing MySQL partitioning
  • 9.
    We cached ActiveRecordobjects as well as HTML partials (bits and pieces of HTML) using Memcached. We chose Memcached because it scales well with multiple clusters. The Memcached client used makes a lot of difference in response time so we went with dalli. CACHING Caching objects that rarely change in their lifetime
  • 10.
    DISTRIBUTED FUNCTIONS Keeping responsetime low by using different storage engines for different purposes We started using Amazon RedShift for analytics and data mining, and Redis to store state information and background jobs for Resque. But because Redis can’t scale or fallback, we don’t use it for atomic operations.
  • 11.
    We decided thatscaling horizontally by sharding was the only cost-effective way to increase write scalability beyond the instance size. But scaling vertically can only get you so far.
  • 12.
    Two main concernswe had before we took the final call on sharding: 1. No distributed transactions – We wanted all tenant details to be in one shard. 2. Rebalancing the shards should be easy – We wanted control over which tenant sits in which shard and to be able to move them around when needed. A little research showed us that directory-based sharding was the only way to go.
  • 13.
    REASONS FOR CHOOSING DIRECTORY- BASEDSHARDING It is simpler than hash key-based or range-based sharding. Rebalancing shards is easier here than in other methods.
  • 14.
    A typical directoryentry looks like this tenant info shard_details shard_status Stark Industries shard1 Read & Write • tenant_info - unique key referring to the DB entry • shard_details - shard in which that tenant exists • shard_status - tells what kind of activity the tenant is ready for (we have multiple shard statuses like Not Ready, Only Reads, Read & Write etc)
  • 15.
    The sharding APIeven acts as a unique ID generator so that the tenant ID generated is unique across shards. How directory lookups work API wrapper is tuned to accept the tenant information in multiple forms like tenant URL, tenant ID etc.
  • 16.
    Sometimes a customergrows from processing 1000 tickets per day to 10,000 tickets per day. This will affect the performance of the whole shard. We can’t solve this by splitting up customer data into multiple shards because we didn’t want the mess of distributed transactions. So, in these cases, we’d move the noisy customer to a shard of his own. That way, everybody wins. Why we care about rebalancing
  • 17.
  • 18.
    Every shard willhave its own slave to scale the reads. For example, say Wayne Enterprises and Stark industries are in shard1. 1 Wayne Enterprises shard1 Read & Write Stark Industries shard1 Read & Write The directory entry looks like this:
  • 19.
    If Wayne enterprisesgrows at a breakneck pace, we would decide to move it to another shard. (averting the danger of Bruce Wayne and Tony Stark being mad at us the same time). 2
  • 20.
    So we wouldboot up a new slave to shard1 and call it shard2. Then, we’d attach a read replica to the new slave and wait for it to sync with the master. 3
  • 21.
    We would thenstop the writes for Wayne Enterprises by changing the shard status in the directory. 4 Wayne Enterprises shard1 Read Only Stark Industries shard1 Read & Write
  • 22.
    Then we wouldstop the replication of master data in shard2 and promote it to master. 5 Now the directory entry will be updated accordingly. Wayne Enterprises shard2 Read & Write Stark Industries shard1 Read & Write
  • 23.
    This effectively movesWayne Enterprises to its own shard. Batman is happy and so is Iron Man. 6
  • 24.
    1. Don’t doit unless it’s absolutely necessary. You will have to rewrite code for your whole app, and maintain it. 2. You could use functional partitioning (moving an over-sized table to another DB altogether) to completely avoid sharding if writes are not a problem. 3. Choosing the right sharding algorithm is a bit tricky as each has its own benefits and drawbacks. You need to make a thorough study of all your requirements while picking one. 4. You will have to take care of the Unique ID generation across shards. Word of caution
  • 25.
    We get 250,000tickets across Freshdesk every day and 100 M queries during the same time (with a peak of 3-4k QPS). We have a separate shard now for all new sign ups. And each shard can roughly carry 20,000 tenants. In the future, we’d like to explore Multi-pod architecture and also look at Proxy architecture using MySQL Fabric, Scalebase etc. What’s next for Freshdesk
  • 26.
    “Behind every slideshareis a great blogpost” Read more about scaling freshdesk here http://blog.freshdesk.com/how-freshdesk-scaled-using- sharding/