Azure DocumentDb

Italian Virtual Chapter – 19.10.2016
Azure DocumentDb
Marco Parenzan
Microsoft MVP for Azure
Microsoft Azure Trainer @ Cloud Academy SAGL
Community Lead 1nn0va
marco.parenzan@1nn0va.it
@marco_parenzan

Document Db
◇Fully managed
◇Schema agnostic
◇Scalable
◇Tunable consistency levels
◇Tunable indexing policies
◇Familiar SQL syntax for querying
◇JavaScript execution

Documents

Developer Appeal
◇Document is JSON Document
◇DocumentDb is a schemaless Db
◇Resilient to iterative schema changes
◇Promote code first development (mapping objects to json)
◇Low impedance as object / JSON store; no ORM required
◇Richer query and indexing (compared to KV stores)
◇It just works
◇It’s fast
◇It’s great for Catalog Data, Preference and State, Event Store, User Generated
Content, Data Exchange

Train yourself with ViewModels
◇Implement a real contractsomething to exchange from Presentation to BL/DA
◇ViewModel=a model that is functional just for presentation, not persistence
￭No more Ids
￭No more null fields
￭No more grayed/hidden fields
￭No more graphs
￭No more joins
￭No many roles per entity (just one)
◇Greatly represented in JSON

Come as you are
Data normalization
Embedding vs. Referencing

embed reference
Embedding vs. referencing

Referencing
◇Representing one-to-many relationships.
◇Representing many-to-many relationships.
◇Related data changes frequently.
◇Referenced data could be unbounded
◇Provides more flexibility than embedding
￭More round trips to read data
◇Normalizing typically provides better write performance

Embedding
◇There are contains relationships between entities.
◇There are one-to-few relationships between entities.
◇There is embedded data that changes infrequently.
◇There is embedded data won't grow without bound.
◇There is embedded data that is integral to data in a document.

Resource Model
◇DocumentDb is Platform as a Service
￭No OnPremise
◇RESTful API
￭All DocDb elements public and accessible as Resource Uri
◇Resource
￭Json Resources

Resource Model Items
Database Account Databases Collections Documents

Database Account
◇Unit of Autorization
◇Unit of Consistency

Unit of Authorization
◇Master keys
￭Upon creation of a DocumentDB account, two master keys (primary and secondary)
are created. These keys enable full administrative access to all resources within the
DocumentDB account.
◇Read-only keys
￭Upon creation of a DocumentDB account, two read-only keys (primary and
secondary) are created. These keys enable read-only access to all resources within
the DocumentDB account.

Unit of Consistency
◇Query / transaction throughput (and reliability – i.e., hardware failure) depend on
replication!
￭All writes to the primary are replicated across two secondary replicas
￭All reads are distributed across three copies
￭“Scalability of throughput” – allowing different clients to read from different replicas
helps prevent bottlenecks
◇BUT replication takes time!
￭Potential scenario: some clients are reading while another is writing
￭Now, the data is stale (out-of-date), inconsistent!

Tweakable Consistency
◇Trade-off: speed (performance & availability) or consistency (data correctness)?
￭“Does every read need the MOST current data?”
￭“Or do I need every request to be handled and handled quickly?”
◇4 options …
￭Strong, Session, Bounded Staleness, Eventual
￭Default consistency for the entire Db…
￭At collection basis in a future release
￭On query basis (optional parameter on CreateDocumentQuery method)

Stale data
◇ViewModel is state
￭ViewModel is disconnected state from our trueness (the DB!)
￭ViewModel is duplicated state from DB
￭Many users can duplicataten-uplicate state from DB
◇So…which is reality?
￭You have STALE data, you have a lot of smell
◇What smells?
￭Copies of the data that are not the truth
◇Entity can be a lie, because it says “that will be the state”

CAP Theorem
◇Consistency:
￭All nodes should see the same data at the same time
◇Availability:
￭Node failures do not prevent survivors from continuing to operate
◇Partition-tolerance:
￭The system continues to operate despite network partitions
◇A distributed system can satisfy any two of these guarantees at the same time but
not all three

Strong
◇client always sees completely consistent data
◇Slowest reads / writes
◇Mission critical: e.x. stock market, banking, airline reservation

Session
◇Default – even trade-off between performance & availability vs. data correctness
◇client reads its own writes, but other clients reading this same data might see older
values

Bounded Staleness
◇client might see old data, but it can specify a limit for how old that data can be (ex.
2 seconds)
◇Updates happen in order received
◇similar to Session consistency, but speeds up reads while still preserving the order
of updates

Eventual
◇client might see old data for as long as it takes a write to propagate to all replicas
◇High performance & availability, but a client might sometimes read out-of-date
information or see updates out of order

Setting Consistency
◇At the database level (see preview portal)
◇On a per-read or per-query basis (optional parameter on CreateDocumentQuery
method)

Globally Distributed
◇Azure DocumentDB gives you the
ability cheat the speed of light!
◇Not just for disaster recovery….
DocumentDB is unreasonably highly
available
◇Replicate data across any # of regions
of your choice
◇Low-latency access to your data
around the globe
◇Dynamically configure your write and
read regions

Databases
◇Unit of Namespace

Collections

DocumentDb Performance
◇Data is saved on SSD
◇All writes to the primary are replicated across two secondary replicas
￭(Replicas are spread on different hardware in same region to protect
against failures)
◇All reads are distributed across the three copies (when and how depend on
consistency level for db account and query)

Collections
◇A unit of scale for transaction
￭for stored procedures and triggers
◇A unit of query throughput
￭capacity units allocated uniformly across all collections)
◇A unit of replication
￭A collection is replicated three times
◇A container of JSON documents
￭JSON docs inside of a collection can vary dramatically

Collections
Database Account
Users
Permissions
Collections Documents
Stored Procedures
Triggers
User Defined Functions
JS
JS
JS
AttachmentsDatabases

Unit of query throughput
◇Collection-based RU Reservation
￭Capacity units allocated uniformly across all collections)
◇Standard pricing tier with hourly billing
◇Performance levels can be adjusted
◇Each collection = 10GB of SSD
￭Limit of 100 collections (1 TB)
￭Soft limit, can be lifted as needed per account (with Support)

Performance levels

Request Units
◇Predictable Performance
◇Each DocumentDB collection has
reserved throughput in terms of request
units (RUs)
◇Normalized currency across database
operations
◇RU== 𝑓 𝑀𝑒𝑚𝑜𝑟𝑦, 𝐶𝑃𝑈, 𝐼𝑂
◇RUs offer accurate accounting in face
of diverse database operations
Operation RU Consumed
Reading a single 1KB document 1
Reading a single 2KB document 2
Query with a simple predicate for a 1KB document 3
Creating a single 1 KB document with 10 JSON
properties (consistent indexing)
14
Create a single 1 KB document with 100 JSON
properties (consistent indexing)
20
Replacing a single 1 KB document 28
Execute a stored procedure with two create
documents
30

DEMO

Partitioning

Why Partition?
◇Data Size
A single collection holds 10GB
◇Throughput
3 Performance tiers with a max of 2,500 RU/sec

Collection
Request
Partitioning our data

Partitioning our data
Partition 1
Request
Request
Partition 2
Logical grouping

Evenly distribute across n number of partitions (algorithmic) ….
Partitioning - Hash

Keep current data hot, Warm historical data, Scale-down older data, Purge / Archive
}current
period
Partitioning - Range

Home tenant / user to a specific partition. Use "master" lookup.
Tenant Partition Id
Customer 1
Big Customer 2
Another 3
Cache this shard map
to avoid making
the lookup the
bottleneck
Partitioning - Lookup

Indexing

Index policies
◇customize index management
including storage
◇overhead, throughput and query
consistency
￭range, hash and spatial indexes
￭included and excluded paths
￭indexing mode; consistent or lazy
￭index precision
￭online, in-place index transformations
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
"excludedPaths": []
}

Indexing Policies
Configuration Level Options
Automatic Per collection True (default) or False
Override with each document write
Indexing Mode Per collection Consistent or Lazy
Lazy for eventual updates/bulk ingestion
Included and excluded
paths
Per path Individual path or recursive includes (? And *)
Indexing Type Per path Support Hash (Default) and Range
Hash for equality, range for range queries
Indexing Precision Per path Supports 3 – 7 per path
Tradeoff storage, query RUs and write RUs

Indexing Paths
Path Description/use case
/ Default path for collection. Recursive and applies to whole document tree.
/"prop"/? Serve queries like the following (with Hash or Range types respectively):
SELECT * FROM collection c WHERE c.prop = "value"
SELCT * FROM collection c WHERE c.prop > 5
/"prop"/* All paths under the specified label.
/"prop"/"subprop"/ Used during query execution to prune documents that do not have the
specified path.
/"prop"/"subprop"/? Serve queries (with Hash or Range types respectively):
SELECT * FROM collection c WHERE c.prop.subprop = "value"
SELECT * FROM collection c WHERE c.prop.subprop > 5

Indexing tips
◇Use lazy indexing for faster peak time ingestion rates
◇Exclude unused paths from indexing for faster writes
◇Specify range index path type for all paths used in range queries
◇Vary index precision for write vs query performance vs storage tradeoffs
◇http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-
documentdb-part-2/

Querying

Query
◇Query over heterogeneous documents
without defining schema or managing indexes
◇Query arbitrary paths, properties and values
without specifying secondary indexes or
indexing hints
◇Execute queries with consistent results
◇Supported SQL features; predicates,
iterations (arrays), sub-queries, logical
operators, UDFs, intra-document JOINs, JSON
transforms
◇In general, more predicates result in a larger
request charge.
◇Additional predicates can help if they result
in narrowing the overall result set.
from book in client.CreateDocumentQuery<Book>(collectionSelfLink)
where book.Title == "War and Peace"
select book;
from book in client.CreateDocumentQuery<Book>(collectionSelfLink)
where book.Author.Name == "Leo Tolstoy"
select book.Author;
-- Nested lookup against index
SELECT B.Author
FROM Books B
WHERE B.Author.Name = "Leo Tolstoy"
-- Transformation, Filters, Array access
SELECT { Name: B.Title, Author: B.Author.Name }
FROM Books B
WHERE B.Price > 10 AND B.Language[0] = "English"
-- Joins, User Defined Functions (UDF)
SELECT udf.CalculateRegionalTax(B.Price, "USA", "WA")
FROM Books B
JOIN L IN B.Languages
WHERE L.Language = "Russian"
LINQ Query
SQL Query Grammar

Programmability

function region(doc)
{
switch (doc.Location.Region)
{
case 0:
return "North";
case 1:
return "Middle";
case 2:
return "South";
}
}
Query with user-defined function
◇The complexity of
a query impacts
the request units
consumed for an
operation:
◇Use of user-
defined functions
(UDFs)
￭SELECT or WHERE
clauses
◇To take
advantage of
indexing, try and
have at least one
filter against an
indexed property
when leveraging a
UDF in the WHERE
clause.

function count(filterQuery, continuationToken) {
var collection = getContext().getCollection();
var maxResult = 25; // MAX number of docs to process in one batch,
when reached, return to client/request continuation.
// intentionally set low to demonstrate the concept. This can be much
higher. Try experimenting.
// We've had it in to the high thousands before seeing the stored
proceudre timing out.
// The number of documents counted.
var result = 0;
tryQuery(continuationToken);
}
Executing Stored Procedures
◇Execute
“explicit”
Javascript
code on
collection

Conclusions

Conclusions
◇DocumentDb is a Restful service
◇Documents defines Unit of Costs with Resource Units
◇Database Account defines Accessibility and Consistency
◇Database is a Namespace placeholder
◇Containers is the unit of Scale

Usage: what is DocumentDb for?
◇User generated content
◇Many specific data (varbinary(MAX) in SQL)
◇Catalog data
◇Log data
◇User preferences data
◇Device sensor data
◇IoT use cases commonly share some patterns in how they ingest, process and
store data. First, these systems allow for data intake that can ingest bursts of data
from device sensors of various locales. Next, these systems process and analyze
streaming data to derive real time insights. And last but not least, most if not all data
will eventually land in a data store for adhoc querying and offline analytics.

Any questions?
You can find me at: marco.parenzan@1nn0va.it/@marco_parenzan
Thanks!

Azure DocumentDb

More Related Content

What's hot

Viewers also liked

Similar to Azure DocumentDb

More from Marco Parenzan

Recently uploaded

Azure DocumentDb

Editor's Notes