Modeling data and best practices for the Azure Cosmos DB.

Modeling data and best practices
for the Azure Cosmos DB
Mohammad Asif Waquar
@asifwaquar

2
about me
Senior Software Engineer at ABN AMRO
https://www.linkedin.com/in/mohammad-asif-6a6153111/

SQL PASS Chapter Team
@arrnagaraj
@Sachit_Keshari
@SanjivVenkatram
@sarbjitgill
@aaroh_bits
@Pioisms

Agenda
Intro Cosmos DB
Resource Model
Data Modelling Strategy & Partitioning
Demo SQL API

Turnkey global distribution
Elastic scale out
of storage & throughput
Comprehensive SLAs
Guaranteed low latency at the 99th percentile
Five well-defined consistency models
Azure Cosmos DB
A globally distributed, massively scalable, multi-model database service

Elastic scale out
Comprehensive SLAs
Azure Cosmos DB
Column-family Document
Graph
Key-value

Column-family Document
Graph
Elastic scale out
Comprehensive SLAs
TableAPI
Key-value
Cosmos DB’s API for
MongoDB
Azure Cosmos DB

Features
• Multi-model data paradigm: key-value, document, graph, family of columns;
• Low latency for 99% of queries: less than 10 ms for read operations and less than 15 ms for
(indexed) write operations;
• Designed for high throughput;
• Ensures availability, consistency of data, delay at SLA level of 99.999%;
• Configurable throughput;
• Automatic replication (master-slave);
• Automatic data indexing;
• Configurable levels of consistency of data. Five different levels (Strong, Bounded Staleness,
Session, Consistent Prefix, Eventual);

CONTAINERS
Logical resources “surfaced” to APIs as tables,
collections or graphs, which are made up of one or
more physical partitions or servers.
Containers
Resource Partitions
CollectionsTables Graphs
Tenants
Follower
Follower
Leader
Forwarder
Replica Set
RESOURCE PARTITIONS
• Consistent, highly available, and resource-governed
coordination primitives
• Consist of replica sets, with each replica hosting an
instance of the database engine
To remote resource partition(s)
Resource Hierarchy

Account
DatabaseDatabaseDatabase
DatabaseDatabaseContainer
DatabaseDatabaseItem
Account URI and Credentials
********.azure.com
pass…

Account
Creating Account

Account
Database Representations

Account
= Collection Graph Table
Container Representations

Account
DatabaseDatabase
Item Document Vertices/Edges Row
Collection Graph Table
Item Representations

Account
DatabaseDatabaseItem Conflict
Stored
procedure
Trigger UDF
Container-Level Resources

Data Modelling Strategy & Partitioning

Ways to Model Your Data
Normalize everything
Embed as 1 piece

Data Modelling: Relational vs. Document
UserID Name Dob
1 John Smith 8/30/1964
StockID UserID Qty Symbol
1 1 100 MSFT
2 1 75 WMT
Document
{
"id": 1,
"name": "John Smith",
"dob": "1964-30-08",
"holdings": [
{ "qty": 100, "symbol": "MSFT" },
{ "qty": 75, "symbol": "WMT" }
]
}
User Table
Holdings Table
Relational Store Document Store
Rows Documents
Columns Properties
Strongly-typed schemas Schema-free
Highly normalized Typically denormalized

Modelling challenges
• How to de-normalize ?
• How to normalize ?
• To embed or reference ?
• Can I apply joins ?
• Should I put data types in same collection ,or different ?

Modelling challenges: To embed or reference ?
Document
"id": 1,
"name": "John Smith",
"dob": "1964-30-08",
"holdings": [
{ "qty": 100, "symbol": "MSFT" },
{ "qty": 75, "symbol": "WMT" }
]
Document
{
"postid": "1",
"title": "My blog post",
"body": "Post content…",
"comments": [
"comment #1",
"comment #2",
"comment #3",
"comment #4“,
:
"comment #1598873",
:
Embed
Reference
Document
{
"postid": "1",
"title": "My blog post",
"body": "Post content…“
}
Document
Document
{ Document{
}
}
{
"postid": "1",
"comment": "comment #3“
}

When to embed ?
o Data that is queried together, should live together.
o Child data is dependent on parent.
o 1:1 relationship eg. All customer have email, phone, nric number for
1:1 relationship.
o Data doesn’t change that frequently eg. Email ,address don’t change too often.
o Usually embedding provides better read performance but trade-off for write performance,
So if we aren’t doing more write this approach will be good.

When to reference ?
o 1 : many (unbounded relationship)
o many : many relationships
o Data changes at different rates
o What is referenced, is heavily referenced by many others
o Typically provides better write performance
o But may require more network calls for reads

Why is choice of partition key so important?
o Enables your data in Cosmos DB to scale
o Large impact on performance of system
What can go wrong?
o Hot partitions
o Choice forces many cross-partition queries for workload
Partitioning

Logical partition: Stores all data associated with the same partition key value
Physical partition: Fixed amount of reserved SSD-backed storage + compute.
Cosmos DB distributes logical partitions among a smaller number of physical partitions.
From your perspective: define 1 partition key per container
Partitioning

Partition Key: User Id
Logical Partitioning Abstraction
Behind the Scenes:
Physical Partition Sets
hash(User Id)
Psuedo-random distribution of data over
range of possible hashed values
Cosmos DB Container (e.g. Collection)

hash(User Id)
….
Melvin
karen
…
Physical
Partition 1
Physical
Partition 2
Physical
Partition n
John
Dharma
Shireesh
Nilesh
Sukhi
Bob
Milton
…
Frugal # of Partitions based on actual storage and throughput needs
(yielding scalability with low total cost of ownership)
Range 1 Range 2 Range n

hash(User Id)
….
Melvin
Karen
…
Physical
Partition 1
Physical
Partition 2
Physical
Partition n
John
Dharma
Shireesh
Nilesh
Sukhi
Bob
Milton
…
What happens when partitions need to grow?
Range 1 Range 2 Range n

hash(User Id)
Partition X
Dharma
Shireesh
Nilesh
Sukhi
Bob
Milton
…
+
Dharma
Shireesh
…
Partition X1
Nilesh
Sukhi
…
Partition X2
Partition Ranges can be dynamically sub-divided
To seamlessly grow database as the application grows
While sedulously maintaining high availability
Range 1 Range 2 Range X1 Range X2
Range X

hash(User Id)
Partition Ranges can be dynamically sub-divided
To seamlessly grow database as the application grows
While sedulously maintaining high availability
Best of All:
Partition management is completely taken care of by the system
You don’t have to lift a finger… the database takes care of you.
Partition X
Dharma
Shireesh
Nilesh
Sukhi
Bob
Milton
…
+
Dharma
Shireesh
…
Partition X1
Nilesh
Sukhi
…
Partition X2
Range 1 Range 2 Range X1 Range X2

How do you ensure consistent reads across replicas?
- Define a consistency level
Replication within aregion
- Data moves extremely fast (typically, within1ms)between neighboring
racks
Global replication
- Ittakeshundreds of milliseconds to move data across continents
Strongerconsistency
Higherlatency
Loweravailability
Weakerconsistency
Lower latency Higher
availability
Replication and Consistency

Consistency Level Guarantees
Strong Linearizability (once operation is complete, it will be visible to all), No dirty reads
Bounded Staleness Consistent Prefix.
Reads lag behind writes by at most k prefixes or t interval (Dirty reads possible Bounded by
time and updates.)
Similar properties to strong consistency (except within staleness window), while preserving 99.99%
availability and low latency.
Session Consistent Prefix.
Within a session: Predictable consistency for a session, high read throughput + low latency
No dirty reads for writers (read your own writes),Dirty reads possible for other users
Consistent Prefix Reads will never see out of order writes (no gaps).
Eventual Potential for out of order reads. Lowest cost for reads of all consistency levels.
Well-Defined Consistency Models

Important Links
https://azure.microsoft.com/en-us/pricing/calculator/?service=cosmos-db#cosmos-db7aed2059-b457-48cc-
a0e9-6744ce81096b
Pricing Calculator
https://docs.microsoft.com/en-us/azure/cosmos-db/sql-query-getting-started
Azure Cosmos Emulator
https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator#controlling-the-emulator
SQL API Query
http://www.microsoft.com/en-us/download/details.aspx?id=46436
Data Migration Tool

Modeling data and best practices for the Azure Cosmos DB.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modeling data and best practices for the Azure Cosmos DB.

Similar to Modeling data and best practices for the Azure Cosmos DB. (20)

More from Mohammad Asif

More from Mohammad Asif (8)

Recently uploaded

Recently uploaded (20)

Modeling data and best practices for the Azure Cosmos DB.