Italian Virtual Chapter – 19.10.2016
Azure DocumentDb
Marco Parenzan
Microsoft MVP for Azure
Microsoft Azure Trainer @ Cloud Academy SAGL
Community Lead 1nn0va
marco.parenzan@1nn0va.it
@marco_parenzan
Italian Virtual Chapter – 19.10.2016
Document Db
◇Fully managed
◇Schema agnostic
◇Scalable
◇Tunable consistency levels
◇Tunable indexing policies
◇Familiar SQL syntax for querying
◇JavaScript execution
Italian Virtual Chapter – 19.10.2016
Documents
Italian Virtual Chapter – 19.10.2016
Developer Appeal
◇Document is JSON Document
◇DocumentDb is a schemaless Db
◇Resilient to iterative schema changes
◇Promote code first development (mapping objects to json)
◇Low impedance as object / JSON store; no ORM required
◇Richer query and indexing (compared to KV stores)
◇It just works
◇It’s fast
◇It’s great for Catalog Data, Preference and State, Event Store, User Generated
Content, Data Exchange
Italian Virtual Chapter – 19.10.2016
Train yourself with ViewModels
◇Implement a real contractsomething to exchange from Presentation to BL/DA
◇ViewModel=a model that is functional just for presentation, not persistence
■No more Ids
■No more null fields
■No more grayed/hidden fields
■No more graphs
■No more joins
■No many roles per entity (just one)
◇Greatly represented in JSON
Italian Virtual Chapter – 19.10.2016
Come as you are
Data normalization
Embedding vs. Referencing
Italian Virtual Chapter – 19.10.2016
embed reference
Embedding vs. referencing
Italian Virtual Chapter – 19.10.2016
Referencing
◇Representing one-to-many relationships.
◇Representing many-to-many relationships.
◇Related data changes frequently.
◇Referenced data could be unbounded
◇Provides more flexibility than embedding
■More round trips to read data
◇Normalizing typically provides better write performance
Italian Virtual Chapter – 19.10.2016
Embedding
◇There are contains relationships between entities.
◇There are one-to-few relationships between entities.
◇There is embedded data that changes infrequently.
◇There is embedded data won't grow without bound.
◇There is embedded data that is integral to data in a document.
Italian Virtual Chapter – 19.10.2016
Resource Model
◇DocumentDb is Platform as a Service
■No OnPremise
◇RESTful API
■All DocDb elements public and accessible as Resource Uri
◇Resource
■Json Resources
Italian Virtual Chapter – 19.10.2016
Resource Model Items
Database Account Databases Collections Documents
Italian Virtual Chapter – 19.10.2016
Database Account
◇Unit of Autorization
◇Unit of Consistency
Italian Virtual Chapter – 19.10.2016
Unit of Authorization
◇Master keys
■Upon creation of a DocumentDB account, two master keys (primary and secondary)
are created. These keys enable full administrative access to all resources within the
DocumentDB account.
◇Read-only keys
■Upon creation of a DocumentDB account, two read-only keys (primary and
secondary) are created. These keys enable read-only access to all resources within
the DocumentDB account.
Italian Virtual Chapter – 19.10.2016
Unit of Consistency
◇Query / transaction throughput (and reliability – i.e., hardware failure) depend on
replication!
■All writes to the primary are replicated across two secondary replicas
■All reads are distributed across three copies
■“Scalability of throughput” – allowing different clients to read from different replicas
helps prevent bottlenecks
◇BUT replication takes time!
■Potential scenario: some clients are reading while another is writing
■Now, the data is stale (out-of-date), inconsistent!
Italian Virtual Chapter – 19.10.2016
Tweakable Consistency
◇Trade-off: speed (performance & availability) or consistency (data correctness)?
■“Does every read need the MOST current data?”
■“Or do I need every request to be handled and handled quickly?”
◇4 options …
■Strong, Session, Bounded Staleness, Eventual
■Default consistency for the entire Db…
■At collection basis in a future release
■On query basis (optional parameter on CreateDocumentQuery method)
Italian Virtual Chapter – 19.10.2016
Stale data
◇ViewModel is state
■ViewModel is disconnected state from our trueness (the DB!)
■ViewModel is duplicated state from DB
■Many users can duplicataten-uplicate state from DB
◇So…which is reality?
■You have STALE data, you have a lot of smell
◇What smells?
■Copies of the data that are not the truth
◇Entity can be a lie, because it says “that will be the state”
Italian Virtual Chapter – 19.10.2016
CAP Theorem
◇Consistency:
■All nodes should see the same data at the same time
◇Availability:
■Node failures do not prevent survivors from continuing to operate
◇Partition-tolerance:
■The system continues to operate despite network partitions
◇A distributed system can satisfy any two of these guarantees at the same time but
not all three
Italian Virtual Chapter – 19.10.2016
Strong
◇client always sees completely consistent data
◇Slowest reads / writes
◇Mission critical: e.x. stock market, banking, airline reservation
Italian Virtual Chapter – 19.10.2016
Session
◇Default – even trade-off between performance & availability vs. data correctness
◇client reads its own writes, but other clients reading this same data might see older
values
Italian Virtual Chapter – 19.10.2016
Bounded Staleness
◇client might see old data, but it can specify a limit for how old that data can be (ex.
2 seconds)
◇Updates happen in order received
◇similar to Session consistency, but speeds up reads while still preserving the order
of updates
Italian Virtual Chapter – 19.10.2016
Eventual
◇client might see old data for as long as it takes a write to propagate to all replicas
◇High performance & availability, but a client might sometimes read out-of-date
information or see updates out of order
Italian Virtual Chapter – 19.10.2016
Setting Consistency
◇At the database level (see preview portal)
◇On a per-read or per-query basis (optional parameter on CreateDocumentQuery
method)
Italian Virtual Chapter – 19.10.2016
Globally Distributed
◇Azure DocumentDB gives you the
ability cheat the speed of light!
◇Not just for disaster recovery….
DocumentDB is unreasonably highly
available
◇Replicate data across any # of regions
of your choice
◇Low-latency access to your data
around the globe
◇Dynamically configure your write and
read regions
Italian Virtual Chapter – 19.10.2016
Databases
◇Unit of Namespace
Italian Virtual Chapter – 19.10.2016
Collections
Italian Virtual Chapter – 19.10.2016
DocumentDb Performance
◇Data is saved on SSD
◇All writes to the primary are replicated across two secondary replicas
■(Replicas are spread on different hardware in same region to protect
against failures)
◇All reads are distributed across the three copies (when and how depend on
consistency level for db account and query)
Italian Virtual Chapter – 19.10.2016
Collections
◇A unit of scale for transaction
■for stored procedures and triggers
◇A unit of query throughput
■capacity units allocated uniformly across all collections)
◇A unit of replication
■A collection is replicated three times
◇A container of JSON documents
■JSON docs inside of a collection can vary dramatically
Italian Virtual Chapter – 19.10.2016
Collections
Database Account
Users
Permissions
Collections Documents
Stored Procedures
Triggers
User Defined Functions
JS
JS
JS
AttachmentsDatabases
Italian Virtual Chapter – 19.10.2016
Unit of query throughput
◇Collection-based RU Reservation
■Capacity units allocated uniformly across all collections)
◇Standard pricing tier with hourly billing
◇Performance levels can be adjusted
◇Each collection = 10GB of SSD
■Limit of 100 collections (1 TB)
■Soft limit, can be lifted as needed per account (with Support)
Italian Virtual Chapter – 19.10.2016
Performance levels
Italian Virtual Chapter – 19.10.2016
Request Units
◇Predictable Performance
◇Each DocumentDB collection has
reserved throughput in terms of request
units (RUs)
◇Normalized currency across database
operations
◇RU== 𝑓 𝑀𝑒𝑚𝑜𝑟𝑦, 𝐶𝑃𝑈, 𝐼𝑂
◇RUs offer accurate accounting in face
of diverse database operations
Operation RU Consumed
Reading a single 1KB document 1
Reading a single 2KB document 2
Query with a simple predicate for a 1KB document 3
Creating a single 1 KB document with 10 JSON
properties (consistent indexing)
14
Create a single 1 KB document with 100 JSON
properties (consistent indexing)
20
Replacing a single 1 KB document 28
Execute a stored procedure with two create
documents
30
Italian Virtual Chapter – 19.10.2016
DEMO
Italian Virtual Chapter – 19.10.2016
Partitioning
Italian Virtual Chapter – 19.10.2016
Why Partition?
◇Data Size
A single collection holds 10GB
◇Throughput
3 Performance tiers with a max of 2,500 RU/sec
Italian Virtual Chapter – 19.10.2016
Collection
Request
Partitioning our data
Italian Virtual Chapter – 19.10.2016
Partitioning our data
Partition 1
Request
Request
Partition 2
Logical grouping
Italian Virtual Chapter – 19.10.2016
Evenly distribute across n number of partitions (algorithmic) ….
Partitioning - Hash
Italian Virtual Chapter – 19.10.2016
Keep current data hot, Warm historical data, Scale-down older data, Purge / Archive
}current
period
Partitioning - Range
Italian Virtual Chapter – 19.10.2016
Home tenant / user to a specific partition. Use "master" lookup.
Tenant Partition Id
Customer 1
Big Customer 2
Another 3
Cache this shard map
to avoid making
the lookup the
bottleneck
Partitioning - Lookup
Italian Virtual Chapter – 19.10.2016
Indexing
Italian Virtual Chapter – 19.10.2016
Index policies
◇customize index management
including storage
◇overhead, throughput and query
consistency
■range, hash and spatial indexes
■included and excluded paths
■indexing mode; consistent or lazy
■index precision
■online, in-place index transformations
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
"excludedPaths": []
}
Italian Virtual Chapter – 19.10.2016
Indexing Policies
Configuration Level Options
Automatic Per collection True (default) or False
Override with each document write
Indexing Mode Per collection Consistent or Lazy
Lazy for eventual updates/bulk ingestion
Included and excluded
paths
Per path Individual path or recursive includes (? And *)
Indexing Type Per path Support Hash (Default) and Range
Hash for equality, range for range queries
Indexing Precision Per path Supports 3 – 7 per path
Tradeoff storage, query RUs and write RUs
Italian Virtual Chapter – 19.10.2016
Indexing Paths
Path Description/use case
/ Default path for collection. Recursive and applies to whole document tree.
/"prop"/? Serve queries like the following (with Hash or Range types respectively):
SELECT * FROM collection c WHERE c.prop = "value"
SELCT * FROM collection c WHERE c.prop > 5
/"prop"/* All paths under the specified label.
/"prop"/"subprop"/ Used during query execution to prune documents that do not have the
specified path.
/"prop"/"subprop"/? Serve queries (with Hash or Range types respectively):
SELECT * FROM collection c WHERE c.prop.subprop = "value"
SELECT * FROM collection c WHERE c.prop.subprop > 5
Italian Virtual Chapter – 19.10.2016
Indexing tips
◇Use lazy indexing for faster peak time ingestion rates
◇Exclude unused paths from indexing for faster writes
◇Specify range index path type for all paths used in range queries
◇Vary index precision for write vs query performance vs storage tradeoffs
◇http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-
documentdb-part-2/
Italian Virtual Chapter – 19.10.2016
Querying
Italian Virtual Chapter – 19.10.2016
Query
◇Query over heterogeneous documents
without defining schema or managing indexes
◇Query arbitrary paths, properties and values
without specifying secondary indexes or
indexing hints
◇Execute queries with consistent results
◇Supported SQL features; predicates,
iterations (arrays), sub-queries, logical
operators, UDFs, intra-document JOINs, JSON
transforms
◇In general, more predicates result in a larger
request charge.
◇Additional predicates can help if they result
in narrowing the overall result set.
from book in client.CreateDocumentQuery<Book>(collectionSelfLink)
where book.Title == "War and Peace"
select book;
from book in client.CreateDocumentQuery<Book>(collectionSelfLink)
where book.Author.Name == "Leo Tolstoy"
select book.Author;
-- Nested lookup against index
SELECT B.Author
FROM Books B
WHERE B.Author.Name = "Leo Tolstoy"
-- Transformation, Filters, Array access
SELECT { Name: B.Title, Author: B.Author.Name }
FROM Books B
WHERE B.Price > 10 AND B.Language[0] = "English"
-- Joins, User Defined Functions (UDF)
SELECT udf.CalculateRegionalTax(B.Price, "USA", "WA")
FROM Books B
JOIN L IN B.Languages
WHERE L.Language = "Russian"
LINQ Query
SQL Query Grammar
Italian Virtual Chapter – 19.10.2016
DEMO
Italian Virtual Chapter – 19.10.2016
Programmability
Italian Virtual Chapter – 19.10.2016
function region(doc)
{
switch (doc.Location.Region)
{
case 0:
return "North";
case 1:
return "Middle";
case 2:
return "South";
}
}
Query with user-defined function
◇The complexity of
a query impacts
the request units
consumed for an
operation:
◇Use of user-
defined functions
(UDFs)
■SELECT or WHERE
clauses
◇To take
advantage of
indexing, try and
have at least one
filter against an
indexed property
when leveraging a
UDF in the WHERE
clause.
Italian Virtual Chapter – 19.10.2016
function count(filterQuery, continuationToken) {
var collection = getContext().getCollection();
var maxResult = 25; // MAX number of docs to process in one batch,
when reached, return to client/request continuation.
// intentionally set low to demonstrate the concept. This can be much
higher. Try experimenting.
// We've had it in to the high thousands before seeing the stored
proceudre timing out.
// The number of documents counted.
var result = 0;
tryQuery(continuationToken);
}
Executing Stored Procedures
◇Execute
“explicit”
Javascript
code on
collection
Italian Virtual Chapter – 19.10.2016
function normalize() {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var doc = getContext().getRequest().getBody();
var newDoc = {
"Sensor": {
"Id": doc.sensorId,
"Class": 0
},
"Degree": {
"Value": doc.degreeValue,
"Type": 0
},
"Location": {
"Name": doc.locationName,
"Region": doc.locationRegion,
"Longitude": doc.locationLong,
"Latitude": doc.locationLat
},
"id": doc.id
};
// Update the request -- this is what is going to be inserted.
getContext().getRequest().setBody(newDoc);
}
Triggers
◇Execute
“implicit”
Javascript
code on
CRUD
operations
(Insert,
Update,
Delete) on
collections
Italian Virtual Chapter – 19.10.2016
Conclusions
Italian Virtual Chapter – 19.10.2016
Conclusions
◇DocumentDb is a Restful service
◇Documents defines Unit of Costs with Resource Units
◇Database Account defines Accessibility and Consistency
◇Database is a Namespace placeholder
◇Containers is the unit of Scale
Italian Virtual Chapter – 19.10.2016
Usage: what is DocumentDb for?
◇User generated content
◇Many specific data (varbinary(MAX) in SQL)
◇Catalog data
◇Log data
◇User preferences data
◇Device sensor data
◇IoT use cases commonly share some patterns in how they ingest, process and
store data. First, these systems allow for data intake that can ingest bursts of data
from device sensors of various locales. Next, these systems process and analyze
streaming data to derive real time insights. And last but not least, most if not all data
will eventually land in a data store for adhoc querying and offline analytics.
Italian Virtual Chapter – 19.10.2016
Any questions?
You can find me at: marco.parenzan@1nn0va.it/@marco_parenzan
Thanks!

Azure DocumentDb

  • 1.
    Italian Virtual Chapter– 19.10.2016 Azure DocumentDb Marco Parenzan Microsoft MVP for Azure Microsoft Azure Trainer @ Cloud Academy SAGL Community Lead 1nn0va marco.parenzan@1nn0va.it @marco_parenzan
  • 2.
    Italian Virtual Chapter– 19.10.2016 Document Db ◇Fully managed ◇Schema agnostic ◇Scalable ◇Tunable consistency levels ◇Tunable indexing policies ◇Familiar SQL syntax for querying ◇JavaScript execution
  • 3.
    Italian Virtual Chapter– 19.10.2016 Documents
  • 4.
    Italian Virtual Chapter– 19.10.2016 Developer Appeal ◇Document is JSON Document ◇DocumentDb is a schemaless Db ◇Resilient to iterative schema changes ◇Promote code first development (mapping objects to json) ◇Low impedance as object / JSON store; no ORM required ◇Richer query and indexing (compared to KV stores) ◇It just works ◇It’s fast ◇It’s great for Catalog Data, Preference and State, Event Store, User Generated Content, Data Exchange
  • 5.
    Italian Virtual Chapter– 19.10.2016 Train yourself with ViewModels ◇Implement a real contractsomething to exchange from Presentation to BL/DA ◇ViewModel=a model that is functional just for presentation, not persistence ■No more Ids ■No more null fields ■No more grayed/hidden fields ■No more graphs ■No more joins ■No many roles per entity (just one) ◇Greatly represented in JSON
  • 6.
    Italian Virtual Chapter– 19.10.2016 Come as you are Data normalization Embedding vs. Referencing
  • 7.
    Italian Virtual Chapter– 19.10.2016 embed reference Embedding vs. referencing
  • 8.
    Italian Virtual Chapter– 19.10.2016 Referencing ◇Representing one-to-many relationships. ◇Representing many-to-many relationships. ◇Related data changes frequently. ◇Referenced data could be unbounded ◇Provides more flexibility than embedding ■More round trips to read data ◇Normalizing typically provides better write performance
  • 9.
    Italian Virtual Chapter– 19.10.2016 Embedding ◇There are contains relationships between entities. ◇There are one-to-few relationships between entities. ◇There is embedded data that changes infrequently. ◇There is embedded data won't grow without bound. ◇There is embedded data that is integral to data in a document.
  • 10.
    Italian Virtual Chapter– 19.10.2016 Resource Model ◇DocumentDb is Platform as a Service ■No OnPremise ◇RESTful API ■All DocDb elements public and accessible as Resource Uri ◇Resource ■Json Resources
  • 11.
    Italian Virtual Chapter– 19.10.2016 Resource Model Items Database Account Databases Collections Documents
  • 12.
    Italian Virtual Chapter– 19.10.2016 Database Account ◇Unit of Autorization ◇Unit of Consistency
  • 13.
    Italian Virtual Chapter– 19.10.2016 Unit of Authorization ◇Master keys ■Upon creation of a DocumentDB account, two master keys (primary and secondary) are created. These keys enable full administrative access to all resources within the DocumentDB account. ◇Read-only keys ■Upon creation of a DocumentDB account, two read-only keys (primary and secondary) are created. These keys enable read-only access to all resources within the DocumentDB account.
  • 14.
    Italian Virtual Chapter– 19.10.2016 Unit of Consistency ◇Query / transaction throughput (and reliability – i.e., hardware failure) depend on replication! ■All writes to the primary are replicated across two secondary replicas ■All reads are distributed across three copies ■“Scalability of throughput” – allowing different clients to read from different replicas helps prevent bottlenecks ◇BUT replication takes time! ■Potential scenario: some clients are reading while another is writing ■Now, the data is stale (out-of-date), inconsistent!
  • 15.
    Italian Virtual Chapter– 19.10.2016 Tweakable Consistency ◇Trade-off: speed (performance & availability) or consistency (data correctness)? ■“Does every read need the MOST current data?” ■“Or do I need every request to be handled and handled quickly?” ◇4 options … ■Strong, Session, Bounded Staleness, Eventual ■Default consistency for the entire Db… ■At collection basis in a future release ■On query basis (optional parameter on CreateDocumentQuery method)
  • 16.
    Italian Virtual Chapter– 19.10.2016 Stale data ◇ViewModel is state ■ViewModel is disconnected state from our trueness (the DB!) ■ViewModel is duplicated state from DB ■Many users can duplicataten-uplicate state from DB ◇So…which is reality? ■You have STALE data, you have a lot of smell ◇What smells? ■Copies of the data that are not the truth ◇Entity can be a lie, because it says “that will be the state”
  • 17.
    Italian Virtual Chapter– 19.10.2016 CAP Theorem ◇Consistency: ■All nodes should see the same data at the same time ◇Availability: ■Node failures do not prevent survivors from continuing to operate ◇Partition-tolerance: ■The system continues to operate despite network partitions ◇A distributed system can satisfy any two of these guarantees at the same time but not all three
  • 18.
    Italian Virtual Chapter– 19.10.2016 Strong ◇client always sees completely consistent data ◇Slowest reads / writes ◇Mission critical: e.x. stock market, banking, airline reservation
  • 19.
    Italian Virtual Chapter– 19.10.2016 Session ◇Default – even trade-off between performance & availability vs. data correctness ◇client reads its own writes, but other clients reading this same data might see older values
  • 20.
    Italian Virtual Chapter– 19.10.2016 Bounded Staleness ◇client might see old data, but it can specify a limit for how old that data can be (ex. 2 seconds) ◇Updates happen in order received ◇similar to Session consistency, but speeds up reads while still preserving the order of updates
  • 21.
    Italian Virtual Chapter– 19.10.2016 Eventual ◇client might see old data for as long as it takes a write to propagate to all replicas ◇High performance & availability, but a client might sometimes read out-of-date information or see updates out of order
  • 22.
    Italian Virtual Chapter– 19.10.2016 Setting Consistency ◇At the database level (see preview portal) ◇On a per-read or per-query basis (optional parameter on CreateDocumentQuery method)
  • 23.
    Italian Virtual Chapter– 19.10.2016 Globally Distributed ◇Azure DocumentDB gives you the ability cheat the speed of light! ◇Not just for disaster recovery…. DocumentDB is unreasonably highly available ◇Replicate data across any # of regions of your choice ◇Low-latency access to your data around the globe ◇Dynamically configure your write and read regions
  • 24.
    Italian Virtual Chapter– 19.10.2016 Databases ◇Unit of Namespace
  • 25.
    Italian Virtual Chapter– 19.10.2016 Collections
  • 26.
    Italian Virtual Chapter– 19.10.2016 DocumentDb Performance ◇Data is saved on SSD ◇All writes to the primary are replicated across two secondary replicas ■(Replicas are spread on different hardware in same region to protect against failures) ◇All reads are distributed across the three copies (when and how depend on consistency level for db account and query)
  • 27.
    Italian Virtual Chapter– 19.10.2016 Collections ◇A unit of scale for transaction ■for stored procedures and triggers ◇A unit of query throughput ■capacity units allocated uniformly across all collections) ◇A unit of replication ■A collection is replicated three times ◇A container of JSON documents ■JSON docs inside of a collection can vary dramatically
  • 28.
    Italian Virtual Chapter– 19.10.2016 Collections Database Account Users Permissions Collections Documents Stored Procedures Triggers User Defined Functions JS JS JS AttachmentsDatabases
  • 29.
    Italian Virtual Chapter– 19.10.2016 Unit of query throughput ◇Collection-based RU Reservation ■Capacity units allocated uniformly across all collections) ◇Standard pricing tier with hourly billing ◇Performance levels can be adjusted ◇Each collection = 10GB of SSD ■Limit of 100 collections (1 TB) ■Soft limit, can be lifted as needed per account (with Support)
  • 30.
    Italian Virtual Chapter– 19.10.2016 Performance levels
  • 31.
    Italian Virtual Chapter– 19.10.2016 Request Units ◇Predictable Performance ◇Each DocumentDB collection has reserved throughput in terms of request units (RUs) ◇Normalized currency across database operations ◇RU== 𝑓 𝑀𝑒𝑚𝑜𝑟𝑦, 𝐶𝑃𝑈, 𝐼𝑂 ◇RUs offer accurate accounting in face of diverse database operations Operation RU Consumed Reading a single 1KB document 1 Reading a single 2KB document 2 Query with a simple predicate for a 1KB document 3 Creating a single 1 KB document with 10 JSON properties (consistent indexing) 14 Create a single 1 KB document with 100 JSON properties (consistent indexing) 20 Replacing a single 1 KB document 28 Execute a stored procedure with two create documents 30
  • 32.
    Italian Virtual Chapter– 19.10.2016 DEMO
  • 33.
    Italian Virtual Chapter– 19.10.2016 Partitioning
  • 34.
    Italian Virtual Chapter– 19.10.2016 Why Partition? ◇Data Size A single collection holds 10GB ◇Throughput 3 Performance tiers with a max of 2,500 RU/sec
  • 35.
    Italian Virtual Chapter– 19.10.2016 Collection Request Partitioning our data
  • 36.
    Italian Virtual Chapter– 19.10.2016 Partitioning our data Partition 1 Request Request Partition 2 Logical grouping
  • 37.
    Italian Virtual Chapter– 19.10.2016 Evenly distribute across n number of partitions (algorithmic) …. Partitioning - Hash
  • 38.
    Italian Virtual Chapter– 19.10.2016 Keep current data hot, Warm historical data, Scale-down older data, Purge / Archive }current period Partitioning - Range
  • 39.
    Italian Virtual Chapter– 19.10.2016 Home tenant / user to a specific partition. Use "master" lookup. Tenant Partition Id Customer 1 Big Customer 2 Another 3 Cache this shard map to avoid making the lookup the bottleneck Partitioning - Lookup
  • 40.
    Italian Virtual Chapter– 19.10.2016 Indexing
  • 41.
    Italian Virtual Chapter– 19.10.2016 Index policies ◇customize index management including storage ◇overhead, throughput and query consistency ■range, hash and spatial indexes ■included and excluded paths ■indexing mode; consistent or lazy ■index precision ■online, in-place index transformations { "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/*", "indexes": [ { "kind": "Range", "dataType": "Number", "precision": -1 }, { "kind": "Hash", "dataType": "String", "precision": 3 }, { "kind": "Spatial", "dataType": "Point" } ] } ], "excludedPaths": [] }
  • 42.
    Italian Virtual Chapter– 19.10.2016 Indexing Policies Configuration Level Options Automatic Per collection True (default) or False Override with each document write Indexing Mode Per collection Consistent or Lazy Lazy for eventual updates/bulk ingestion Included and excluded paths Per path Individual path or recursive includes (? And *) Indexing Type Per path Support Hash (Default) and Range Hash for equality, range for range queries Indexing Precision Per path Supports 3 – 7 per path Tradeoff storage, query RUs and write RUs
  • 43.
    Italian Virtual Chapter– 19.10.2016 Indexing Paths Path Description/use case / Default path for collection. Recursive and applies to whole document tree. /"prop"/? Serve queries like the following (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop = "value" SELCT * FROM collection c WHERE c.prop > 5 /"prop"/* All paths under the specified label. /"prop"/"subprop"/ Used during query execution to prune documents that do not have the specified path. /"prop"/"subprop"/? Serve queries (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop.subprop = "value" SELECT * FROM collection c WHERE c.prop.subprop > 5
  • 44.
    Italian Virtual Chapter– 19.10.2016 Indexing tips ◇Use lazy indexing for faster peak time ingestion rates ◇Exclude unused paths from indexing for faster writes ◇Specify range index path type for all paths used in range queries ◇Vary index precision for write vs query performance vs storage tradeoffs ◇http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure- documentdb-part-2/
  • 45.
    Italian Virtual Chapter– 19.10.2016 Querying
  • 46.
    Italian Virtual Chapter– 19.10.2016 Query ◇Query over heterogeneous documents without defining schema or managing indexes ◇Query arbitrary paths, properties and values without specifying secondary indexes or indexing hints ◇Execute queries with consistent results ◇Supported SQL features; predicates, iterations (arrays), sub-queries, logical operators, UDFs, intra-document JOINs, JSON transforms ◇In general, more predicates result in a larger request charge. ◇Additional predicates can help if they result in narrowing the overall result set. from book in client.CreateDocumentQuery<Book>(collectionSelfLink) where book.Title == "War and Peace" select book; from book in client.CreateDocumentQuery<Book>(collectionSelfLink) where book.Author.Name == "Leo Tolstoy" select book.Author; -- Nested lookup against index SELECT B.Author FROM Books B WHERE B.Author.Name = "Leo Tolstoy" -- Transformation, Filters, Array access SELECT { Name: B.Title, Author: B.Author.Name } FROM Books B WHERE B.Price > 10 AND B.Language[0] = "English" -- Joins, User Defined Functions (UDF) SELECT udf.CalculateRegionalTax(B.Price, "USA", "WA") FROM Books B JOIN L IN B.Languages WHERE L.Language = "Russian" LINQ Query SQL Query Grammar
  • 47.
    Italian Virtual Chapter– 19.10.2016 DEMO
  • 48.
    Italian Virtual Chapter– 19.10.2016 Programmability
  • 49.
    Italian Virtual Chapter– 19.10.2016 function region(doc) { switch (doc.Location.Region) { case 0: return "North"; case 1: return "Middle"; case 2: return "South"; } } Query with user-defined function ◇The complexity of a query impacts the request units consumed for an operation: ◇Use of user- defined functions (UDFs) ■SELECT or WHERE clauses ◇To take advantage of indexing, try and have at least one filter against an indexed property when leveraging a UDF in the WHERE clause.
  • 50.
    Italian Virtual Chapter– 19.10.2016 function count(filterQuery, continuationToken) { var collection = getContext().getCollection(); var maxResult = 25; // MAX number of docs to process in one batch, when reached, return to client/request continuation. // intentionally set low to demonstrate the concept. This can be much higher. Try experimenting. // We've had it in to the high thousands before seeing the stored proceudre timing out. // The number of documents counted. var result = 0; tryQuery(continuationToken); } Executing Stored Procedures ◇Execute “explicit” Javascript code on collection
  • 51.
    Italian Virtual Chapter– 19.10.2016 function normalize() { var collection = getContext().getCollection(); var collectionLink = collection.getSelfLink(); var doc = getContext().getRequest().getBody(); var newDoc = { "Sensor": { "Id": doc.sensorId, "Class": 0 }, "Degree": { "Value": doc.degreeValue, "Type": 0 }, "Location": { "Name": doc.locationName, "Region": doc.locationRegion, "Longitude": doc.locationLong, "Latitude": doc.locationLat }, "id": doc.id }; // Update the request -- this is what is going to be inserted. getContext().getRequest().setBody(newDoc); } Triggers ◇Execute “implicit” Javascript code on CRUD operations (Insert, Update, Delete) on collections
  • 52.
    Italian Virtual Chapter– 19.10.2016 Conclusions
  • 53.
    Italian Virtual Chapter– 19.10.2016 Conclusions ◇DocumentDb is a Restful service ◇Documents defines Unit of Costs with Resource Units ◇Database Account defines Accessibility and Consistency ◇Database is a Namespace placeholder ◇Containers is the unit of Scale
  • 54.
    Italian Virtual Chapter– 19.10.2016 Usage: what is DocumentDb for? ◇User generated content ◇Many specific data (varbinary(MAX) in SQL) ◇Catalog data ◇Log data ◇User preferences data ◇Device sensor data ◇IoT use cases commonly share some patterns in how they ingest, process and store data. First, these systems allow for data intake that can ingest bursts of data from device sensors of various locales. Next, these systems process and analyze streaming data to derive real time insights. And last but not least, most if not all data will eventually land in a data store for adhoc querying and offline analytics.
  • 55.
    Italian Virtual Chapter– 19.10.2016 Any questions? You can find me at: marco.parenzan@1nn0va.it/@marco_parenzan Thanks!

Editor's Notes

  • #7  instead of taking the business subject / domain entity and breaking it up into multiple relational structures store the business subject in the minimal number of documents.
  • #8 Add diagram showing the differences