CosmosDB for DBAs & Developers

Cosmos DB for DBAs & DEVs
Niko Neugebauer – Consultant @ OH22

Speaker
Niko speaks regularly at events such as PASS Summit, SQLRally,
SQLBits, and SQLSaturday events around the world.
Niko Neugebauer
Professional Focus
Community
Lead the first international SQLSaturday
PASS User Group Leader
TUGA Non-Profit Association Leader
/in/webcaravela/
@NikoNeugebauer
Data Platform (especially from Microsoft)
Columnstore Blogger (110+) at http://www.nikoport.com/columnstore
Creator of CISL – Columnstore Indexes Script Library (https://github.com/NikoNeugebauer/CSIL)

Niko Neugebauer
Consultant, OH22 IS
Professional Focus
Data Platform (especially from Microsoft)
Columnstore Blogger (110+) at http://www.nikoport.com/columnstore
Creator of CISL – Columnstore Indexes Script Library
(https://github.com/NikoNeugebauer/CSIL)
Lead the first international SQLSaturday
PASS User Group Leader
TUGA Non-Profit Association Leader
Speaker
Niko speaks regularly at events such as PASS
Summit, SQLRally, SQLBits, and SQLSaturday
events around the world.• /in/webcaravela/ • @NikoNeugebauer

CAP Theorem – old wisdom: pick just 2!
• Consistency
• Availability
• Partition tolerance

Agenda
• What is CosmosDB ?
• Why CosmosDB ?
• How CosmosDB ?
• Use CosmosDB
• CosmosDB for Developers
• CosmosDB for DBAs

What is CosmosDB
• Azure Cosmos DB is Microsoft's globally distributed, multi-model
database.
• With the click of a button, Azure Cosmos DB enables you to
elastically and independently scale throughput and storage across
any number of Azure's geographic regions.
• It offers throughput, latency, availability, and consistency guarantees
with comprehensive service level agreements (SLAs), something no
other database service can offer.

Data Models in CosmosDB
• Database engine operates on atom-record-sequence
based type system.
All data models translated to A-R-S
• API and wire protocols supported via extensible
modules
Currently supported data models:
• Documents, Graphs, Key-Value, Column-Value

API (30-11-2017)
• DocumentDB API
• SQL-like API
• MongoDB API
• Table API
• Graph API (TinkerPop, Gremlin/Groove)
• Cassandra API
• Spark
• Geospatial support
• more will be coming!

A word on Table API vs Azure Table Storage comparison
Table Storage Cosmos Table API
Latency Fast Single-digit millisecond latency
Throughput
Variable, scalalbe up to 20.000
operations/second
Highly scalable with dedicated
reserved throughput per table,
up to 10 million operations/sec
Global Distribution Single Region Turnkey global distribution
Indexing
Only Primary Index on
PartitionKey and RowKey
Automatic and complete
indexing on all properties, no
index management (LOL).
Query
Query execution uses index for
primary key, and scans
otherwise.
Queries can take advantage of
automatic indexing on
properties for fast query times.
Consistency
Strong in Primary Region,
Eventual in Secondary Reg.
5 well-defined consistency levels

Partitioning
• Implemented on the Tenant-level (Collection, Graph, Table)
• A resource partition is a resource-governed primitive, which is
limited to a subset of keys.
• Capable of doing Splits, Merges, etc from the Partitions

Partitioning Best Practices
- Select a PartitionKey for the best data distribution
- Use location-aware partition key for the best access locality
- Select a PartitionKey which can be a transaction scope
- Don’t use Timestamps for write-heavy workloads. Use time ranges
(hour, month, week, day, year) for even data distribution.

Why creating CosmosDB?
• Traditional relational databases were designed in 70s-80s
• Data is Growing (Petabytes, Exabytes, etc)
• Think about Internet-Scale and distributed systems
• Provide API Choices
Think about:
• Availability
• Performance
• Costs

CosmosDB: the focus on the performance
Reads (1KB) Indexed Writes (1KB)
50th < 2ms < 6ms
99th < 10ms < 15ms
▪ Globally distributed with reads and writes served from/to local
region
▪ Write-optimised, latch-free engine designed for SSD
▪ Synchronous/Asynchronous automatic indexing

Azure Cosmos DB
• Azure Cosmos DB is fully schema agnostic.
• Uses JSON to describe the supported data models
• Automatic indexing of all ingested content
• Resource Governed, write-optimised engine
• Online Index operations

Core pieces of CosmosDB Architecture
• Global distribution
• Resource Governance
• Schema-agnostic service

Consisteny Levels (and there are 5 of them):
• You pick a stronger consistency level like strong/bounded staleness
because for your account, because a critical path in your e-
commerce/LOB application needs the guarantee
• But for some less-critical operations (like a reporting dashboard
query), you would choose a weaker-consistency level because it
consumes only half the throughput.
• The current offering for the Consistency levels is:
Strong / Bounded Staleness / Session / Consistent Prefix / Eventual

Consisteny Levels in 1 Picture:

Default Consisteny Levels:
• Strong - Linear. Reads are guaranteed to return the most recent
version of an item.
• Bounded Staleness - Consistent Prefix. Reads lag behind writes by k
prefixes or t interval
• Session - Consistent Prefix. Monotonic reads, monotonic writes,
read-your-writes, write-follows-reads in your geographical location.
• Consistent Prefix - Updates returned are some prefix of all the
updates, with no gaps. If you applied sequential transactions, the
previous ones are available on request.
• Eventual - Out of order reads

Indexing & Consisteny Levels:
Indexing Mode Reads Queries
Consistent
Select from strong, bounded
staleness, session, consistent
prefix, or eventual
staleness, session, or eventual
Lazy
prefix, or eventual
Eventual
None
prefix, or eventual
Eventual

Throughoutput
• RU – Requests Unit
• % Memory / % CPU / % IOPS just like for Azure SQLDB
• READ / INSERT / UPSERT / DELETE / QUERY - operations
• QUERY = Scans + Index Lookups + Query Complexity + Instruction
Cost
• Everything is calculated by Azure ML 

Throughoutput
• RU – Requests Per Unit
• 400 RU/sec – 10.000 RU/sec (Collections)
• 2.500 RU/sec – Unlimited? RU/sec (Partitioned Collections)
• Min Increase / Decrease is 100 RU/sec

Scaling Cosmos DB Up & Out
• Scale Up – Increase the number of RUs
• Scale Out – Increase the number of partitions for your
collections/graphs/tables

Stored Procs, User-Defined Functions, Triggers, etc
• Is a Server-Side JavaScript Programming
• Procedural Logic
• Atomic Transactions
• Batching
• Pre-Compilation
• Encapsulation

Triggers (validation and Node.JS registration)

Stored Procedures using Javascript API
DO NOT!

Azure Functions
Are supported 

Real Life Problems
• Data Quality (Data Types Casting, Missing Connections)
• Complex Questions (joins)

CosmosDB
• Introduction (Availability (Ring 0), Consistency, 5 9s, PaaS, Scaling)
• Blah
• Stored Procedures
• UDFs
• Triggers

At the Data Centre
• Solid State Drives storage (SSD)
• Fusion IO 160GB Drives
• Fast Private Network Connections

Azure CosmosDB Data Migration Tool
• Allows you to migrate your data into the CosmosDB
• Supports a range of the sources
• Does not support GraphDB ... yet

CosmosDB Query Playground
• https://www.documentdb.com/sql/demo

Try CosmosDB for free (need an Azure account):
• https://azure.microsoft.com/en-us/try/cosmosdb/
46

CosmosDB in Azure Storage Explorer

Azure Cosmos DB Emulator
Software requirements:
• Windows Server 2012 R2, Windows Server 2016, or Windows 10
Minimum Hardware requirements:
• 2 GB RAM
• 10 GB available hard disk space

CosmosDB: DBAs
DBA as in DCT = Data Care Taker

Indexing Policy Modes
• Consistent – follows the same consistency level as specified for the point-
reads (i.e. strong, bounded-staleness, session or eventual). The index is
updated synchronously as part of the document update.
The workload target is “write quickly, query immediately”.
• Lazy - To allow maximum document ingestion throughput, an Azure
Cosmos DB collection can be configured with lazy consistency; meaning
queries are eventually consistent.
The index is updated asynchronously when an Azure Cosmos DB
collection is quite.
• None - A collection marked with index mode of “None” has no index
associated with it. This is commonly used if Azure Cosmos DB is utilized as
a key-value storage and documents are accessed only by their ID
property.

Indexing Policy Modes
Consistency Indexing Mode:
Consistent
Indexing Mode: Lazy
Strong Strong Eventual
Bounded Staleness Bounded Staleness Eventual
Session Session Eventual
Eventual Eventual Eventual

Indexing Policy Modes with EnableScanInQuery
Consistency Indexing Mode:
Consistent
Indexing Mode: Lazy Indexing Mode: None
Strong Strong Eventual Strong
Bounded Staleness Bounded Staleness Eventual Bounded Staleness
Session Session Eventual Session
Eventual Eventual Eventual Eventual

Indexing Paths
Path Description
/ Default path for the collection. Recursive
/name/? Hash or Range Indexes for predicates and sorts
/name/* Index path for all paths under the specified label. (multiple levels down)
/name/[]/prop/? Index path required to serve iteration and JOIN queries against arrays of
objects like [{prop: "a"}, {prop: "b"}]:

Indexes Types, Kinds & Precisions
DataTypes:
• String
• Number
• Point
• Polygon
• LineString

Indexes Types, Kinds & Precisions
Index Types:
• Hash – Hash Indexes, think Hekaton (Hash Indexes). Supports
equality and JOIN queries, for the most queries default value of 3
bytes is sufficient. DataType can be String or Number.
• Range – Range Indexes, think Hekaton (BW-Tree). Supports equality
& range queries (<,>,<=,>=,!=) and ORDER BY queries. DataType
can be String or Number.
• Spatial – Spatial Queries for Points, Polygons & LineString. Supports
efficient spatial (within & distance queries) queries.

Indexes Precision
Lets you tradeoff between index storage overhead and query
performance.
For numbers, Microsoft recommends using the defulat
precision -1 (“maximum”). Notice that numbers are 8 bytes in
JSON.
Picking smaller numbers for precision (1-7) means collisions
and hence more RU’s consumption.
For String ranges, which can be of arbitrary lengths, the index
precision can impact the performance of range search
queries and impact storage.
The precision can be specified between 1 to 100.
Important: if you need sorting on the results (ORDER BY), you
must specify the precision of 100.

Indexes Inclusion / Exclusion
includedPaths: [
{
“path”: “/mainContent/*”,
“indexes”:[
{
“kind”: “Hash”,
“dataType”: “String”,
“precision”: 20
}
]
}
]
excludedPaths: [
{
“path”: “/nonIndexedContent/*”
}
]

Indexing Policy Changes – What for ?
• When importing bulk data using lazy indexing models
for faster writes, switching then to consistent indexing
for regular operation.
• When reducing the throughput for writes as well as the
storage space used by hand selecting the properties to
be indexed and changing them over time, or by varying
the index precision of individual properties.
• When using new indexing features on your current
DocumentDB collections like Order By and string range
queries which require the newly introduced string
range index kind.

Indexing Policy Changes - how ?

Backup for DBAs:
• Every 4 hours (approx.) a backup is taken (to Azure BLOB
Storage)
• At least 2 backups are stored at all times
• If you lost your data, you need to contact Azure Support
within 8 hours
• Backup retention: 30 days for deleted partitions/databases
• If you want to maintain your own snapshots, you can use
the export to JSON option in the Azure Cosmos DB Data
Migration tool to schedule additional backups.

Backup for DBAs – read carefully:
• As soon as corruption is detected, the user should delete
the corrupted container (collection/graph/table) so that
backups are protected from being overwritten with
corrupted data.
Source: https://docs.microsoft.com/en-us/azure/cosmos-
db/online-backup-and-restore

Backup for DBAs – the alternative:
• Extract JSON files of your databases/collections/graphs with
the help of the Azure Migration Tool

Global Distribution aka Geo-Replication aka Reional Failover

Manual Failover Scenarios:
• Follow the clock model: If your applications have predictable traffic patterns
based on the time of the day, you can periodically change the write status to
the most active geographic region based on time of the day.
• Service update: Certain globally distributed application deployment may
involve rerouting traffic to different region via traffic manager during their
planned service update. Such application deployment now can use manual
failover to keep the write status to the region where there is going to be
active traffic during the service update window.
• Business Continuity and Disaster Recovery (BCDR) and High Availability
and Disaster Recovery (HADR) drills: Most enterprise applications include
business continuity tests as part of their development and release process.
BCDR and HADR testing is often an important step in compliance
certifications and guaranteeing service availability in the case of regional
outages. You can test the BCDR readiness of your applications that use
Cosmos DB for storage by triggering a manual failover of your Cosmos DB
account and/or adding and removing a region dynamically.

Global Distribution aka Geo-Replication aka Reional Failover
• Configuration
• First, deploy your application in multiple regions
• To ensure low latency access from every region your application is deployed,
configure the corresponding preferred regions list for each region via one of
the supported SDKs.

GraphDB
• Based on Apache TinkerPop (open source)
• Supporting Gremlin & Groove (How much?) languages

GraphDB - possibilities
• Querying across graph collections - not supported right now
• Duplicate Edges detection
• Duplicate Vertex detection
• Betweness Centrality
• Eigenvector (PageRank)
• Recommendation (as Products in SSAS)
• ...

GraphDB Gremlin querying
• g.V().count(); // Documents
• g.V().hasLabel(‘person’).has(‘age’,gt(40)); // People aged over 40
• g.V().hasLabel('person').values('firstName'); // List People’s first
names
Under the hood, the query
• g.V().hasLabel('Azure')
transforms into
• {"query":"SELECT N_2 FROM Node N_2 WHERE
(IS_DEFINED(N_2._isEdge) = false AND (N_2.label = 'Azure'))"}

GraphDB Migrations
• Neo4J: https://github.com/bsherwin/neo2cosmos
• Migration Tool (soon)

Data Migration Tool:
• https://www.microsoft.com/en-us/download/details.aspx?id=46436

Limitations:
• Returning big amounts of data
• No support for Group BY (SQL Api)

PowerBI
• Via Spark - https://github.com/Azure/azure-cosmosdb-
spark/wiki/Configuring-Power-BI-Direct-Query-to-Azure-
Cosmos-DB-via-Apache-Spark-(HDI)

Geospatial
• Working with geospatial and GeoJSON location data in
Azure Cosmos DB:
https://docs.microsoft.com/en-us/azure/cosmos-
db/geospatial
• Azure Cosmos DB: Expanded geospatial support, including
automatic indexing of Polygon and LineString objects:
https://azure.microsoft.com/en-us/updates/documentdb-
expanded-geospatial-support-including-automatic-
indexing-of-polygons-and-lines/

CosmosDB Links
• https://www.microsoft.com/en-us/download/details.aspx?id=46436
• https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
• Azure CosmosDB Emulator:
https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator
• Indexing Policies:
https://docs.microsoft.com/en-us/azure/cosmos-db/indexing-policies
• Use the Azure Cosmos DB Emulator for local development and testing:
https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator
• Tunable data consistency levels in Azure Cosmos DB:
https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels

CosmosDB Links
• Gremlin Console:
http://tinkerpop.apache.org/docs/current/tutorials/the-gremlin-
console/
• Tunable data consistency levels in Azure Cosmos DB:

Database Console Commands
Rodrigo Crespi, SQL Server specialist
A seguir….

CosmosDB for DBAs & Developers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CosmosDB for DBAs & Developers

Similar to CosmosDB for DBAs & Developers (20)

Recently uploaded

Recently uploaded (20)

CosmosDB for DBAs & Developers