The topics we will be covering in this talk:
1. Introduction - Briefly provide business context to appreciate the need to solve this problem, and challenges involved.
2. The factors driving the decision to choose Cosmos DB as our backend store.
3, Key insights into what drives cost of the store, and various gotchas involved when designing such a system.
4. How to optimize the cost and bring intelligence to enable auto-scalability.
5. The need for building a multi version concurrency control and how to achieve it to enable parallel writes with multiple schema versions for the same record.
6. The tradeoff between readability and storage cost, and how to get the best of both worlds by using inflight abbreviated compression.
8. On Prem Cloud Native
Theoretical caps on scale No such caps
Scalable in steps Granular scalability
Cost of maintenance and
upgrades
Amortized cost
(Man hours and infra)
In house skill set Customer Care
Custom buildable Not as much flexibility
9. Rationale
● Cloud native
○ To mitigate the limits of scaling business
● Inmobi - Microsoft Partnership
● Top contenders
○ Cosmos DB, Aerospike
10. Cosmos DB
● Cloud native alternative for
Cassandra/Aerospike/MongoDB
● Azure equivalent of AWS Dynamodb with a few extra
bells and whistles
● Document store supporting point lookups and queries
○ Fetch document given unique ID
○ SQL type queries spanning across documents
17. Data Splits
● Extending the observations on cost of read and write, data can be
split in unique ways to enable various use cases, based on access
patterns
● User timelines
○ Date boundaries
○ Container Splitting
● User profiles
○ Apps owned
18. ● The InMobi team determines the primary and secondary key,
based on read-write pattern and distribution of records.
● Apart from the read-write pattern, we also consider
○ immutability of records,
○ record size,
○ item level TTL.
Data modelling
19. Data modelling
Useful models at InMobi:
● User level aggregates - for aggregate profiles
● Time based partitioning - for immutable timelines
● Enable top level keys in document which give flexibility for
design change
25. RU = f(docSize, partitions)
● Read RUs are directly proportional to document size
Regardless of increase in partition count and collection size
● RUs consumed for getting non existing document is constant
Regardless of increase in partition count and collection size
● As the collection size grows, though query costs remain constant
The minimum provisioning keeps growing
29. Cost of Degradation
“HTTP Status Code 429: The user has sent too many requests in a given amount of time (rate limiting).”
30. Cost of Degradation
If provisioned for 100 read calls/sec (assume 100RU) and bombarded with say
1000 read calls/sec, we will encounter HTTP 429 error.
The behaviour observed would be <100 calls would succeed as the failed calls
would also consume resources.
31. Cost of degradation
● Recommendation: Honor backoff
● Corner case : In serving systems, such backoff cannot be honored, leaving
the only solution as scaling out systems.
42. Summary
● On Prem vs Cloud Native
● Levers to optimize cost
and performance
○ Size of documents
○ Data models to enable
document splitting
○ Autoscaler
○ Data compression
Future looking
● Partial document updates
● AI to tune autoscaler, and
handle burst traffic
● Enable multi-region write
43. We would love to learn about your typical use
cases of Cosmos DB.
how do you approach costs in your system?
The major business for InMobi is mobile advertising. Which is to show the right advertisement to the right users at the right opportunity.
The flow of information is from right to left and flow of money is from left to right
Publishers are apps which a user is engaging with
SSP aggregates all ad requests from various apps
Exchanges aggregate requests from various SSPs
DSP listens to multiple Exchanges and using intelligence powered by its DMP, responds with the best possible ad for that user at that point in time
The overall latency of this system is typically within 50ms
Consequently the latency with DMP < 10ms
The rate of ad requests received directly impacts the money spent and varies hugely, based on time of day, day of the week and various other seasonal and geographical factors e.g. Cricket World Cup causes increase in content consumption by users.
The number of ads served drives the revenue generated and is dependent on the advertising funds available and not on the number of requests received.
InMobi operates at a global scale and so it needs to serve ads across various geographies while adhering to latency requirements. A huge global presence implies that the traffic patterns vary widely across geographies e.g. The US and India have huge differences in terms of peak traffic times, seasonal behaviours like Diwali or Christmas, types of apps used etc.
An ad request from a user can land on one of many InMobi’s geo redundant serving systems. We need the user’s information to be consistently available across all serving regions, to serve ads uniformly. Global consistency also enables us to support easy fallbacks, load balancing and account for the fact that users travel. This could be achieved using either a master-master or master-slave setup, either ways, globally consistent view of a user’s information is critical.
Upgrades - Security, performance, features etc.
Common scenario where someone sets up a system and is no longer working on it 6 months down the line, when it breaks and adds to the pain of maintanence
Why Cosmos DB ?
The story behing this choice
Redis, Cassandra vs Aerospike - Aerospike. Reliability, scalability etc.
Inhouse expertise with Aerospike.
Cosmos vs Dynamo - Due to partnership makes more financial sense to chose Cosmos DB.
Note this is not an exhaustive list of capabilities of Cosmos DB, but the major use cases for our needs.
The first question we ask ourselves is where and how is the data stored
This follows with a logical next step of what is the cost of this system and how is it measured ?
Introduce Request Units: It is a uber number denoting the CPU/Memory/IO utilization on Cosmos DB.
Measured per read/write/query etc.
Now that we know about documents and RU, how do they relate ??
Did you observe the drastic changes in RU utilization at 16, 32, 64kB?
32KB document split into 16KB documents - If every write on big document translates to 2 small write, total cost would not change.
24KB document split into 16KB + 8KB document - If every write on big document translates to 2 writes on smaller documents, the total cost after split would be higher, since the cost of 24KB write = cost of 16KB write.
If each write on a big document translates to only one write on a smaller document, then splitting a 32KB document would mean your write cost would halve straight away.
Whereas for a 24KB document the cost benefit would not be as significant even when doing a single write on one of the smaller documents.
In contrast, a case where we see significant benefit is when working with large documents. Reading multiple split documents is cheaper than reading one big document under certain conditions - if a 40KB document is split into 4x10KB documents, all with the same partition key, total read and write RU falls significantly in the latter case.
So the main takeaway is keep document sizes small,
This may result in bigger collections, which can be split into smaller collections
Indexing is required to enable _ttl and queries.
With full indexing enabled, it will be possible to run spark jobs using Cosmos DB as the backend DB
Indexing increases write costs, while the read costs remain stable.
Avoid indexing indiscriminately, unless a cost-benefit analysis is done on the use of running queries on Cosmos DB.
Alternatives include running batch jobs on dumps of the data.
Understanding costs is nice, but how does my system scale ? What are the building blocks ?
Horizontal scalability is achieved using partitioning and replication
Given what we know about documents, cost and scaling factors, how do these interplay ??
The expected behaviour might be that in a second first 100 read calls would succeed and remaining 900 calls would fail. This would be the case if those 900 failure calls were not consuming any resources. But in the case of “Too many requests” (HTTP error code 429) these failed read calls also consume some amount of constant resources, which eats into your 100 success calls. Thus the behaviour is not 100 calls success and 900 calls failure, the behaviour is all 1000 calls consume resources, and few of those will succeed and rest fail, implying a successful response for less than 100 calls.
Image for skewed degradation
Cosmos DB supports scaling of RUs provisioned through API and also via the portal.
Automating this scaling is a critical feature, to optimize the cost utilization.
Depending on the variation in the usage pattern of Cosmos DB RUs, the auto scaling feature can make or break the whole system being built.
At InMobi, the traffic observed across days is dependent on various factors as explored earlier, and if we provision our systems to the maximum required, Cosmos DB costing would become unreasonably large to build a sustainable business model.
A decrease in RU consumption on throttling - RU consumption decreases due to honouring backoff retry logic, but we have to increase the RU at this point of time.
Handling hot partitions on throttling - Contrary to the previous point, there can be some hot-partitions getting throttled which might push the auto-scaler to increase RU more than whatever is needed.
Interestingly, even though the overall provisioning is sufficient enough, we will see few 429s in the system. This might seem similar to situation #2 at first glance. Simply increasing the RUs for such a scenario will massively overprovision the system, bloating up the cost significantly.
Before increasing RUs, the algorithm needs to be cognizant of request skewness.
Cost model of Cosmos for RU consumption: As mentioned earlier, RUs are charged at hourly quota for the maximum RU allocated for the running hour. Do we really need to decrease the RUs if we are paying for the maximum allocated RUs during an hourly interval?
When building a user profile, disparate information is ingested from various sources.
Summarization of this information is stored in Cosmos DB
Information coming from various sources will not have the same schema
User profiles also evolve with time.
Modification of the records concurrently is prone to data loss due to race condition.
We perform partial serialization and deserialization at client-end to handle corner cases.
Records are serialized in Cosmos as json objects.
During deserialization, the respective schema of event is fetched from a schema registry.
The registry helps in book-keeping the schema-mapping (Avro schemas in our use case) for every user feature, and the respective record fragments (fragment of record with the User Feature we want to update) are deserialized.
Any operations applicable over the fragment are performed before serializing again to JSON and dumping to Cosmos DB.