2. Denny Lee
• Principal Program Manager for Azure DocumentDB
• 20+ years of experience in databases, distributed systems, data
sciences, and software development at Microsoft, Concur, and
Databricks
• Noteable Projects:
• Project Isotope: Incubation team for HDInsight
• Yahoo! 24TB cube: Largest SSAS cube in production
@dennylee
14. Advantages
Blazing Fast IoT Scenarios
Flight
information
global safety
alerts
weather
Data Science Scenarios
Device
Notifications
Web / REST API
15. Advantages
Updateable Columns
Flight
information
Data Science Scenarios
Device
Notifications
Web / REST API
{
tripid: “100100”,
delay: -5,
time: “01:00:01”
}
{
tripid: “100100”,
delay: -30,
time: “01:00:01”
}
{delay:-30}
{delay:-30}
{delay:-30}
16. Advantages
Pushdown Predicate Filtering Data Science Scenarios
{city:SEA}
locations headquarter exports
0 1
country
Germany
city
Seattle
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
{city:SEA, dst: POR, ...},
{city:SEA, dst: JFK, ...},
{city:SEA, dst: SFO, ...},
{city:SEA, dst: YVR, ...},
{city:SEA, dst: YUL, ...},
...
17. gateway
node data
nodes
master
node
worker nodes
pyDocumentDB
1
2
3
pyDocumentDB
1. Connection is between Spark
master node and DocumentDB
gateway node.
2. Query is submitted from
DocumentDB gateway node to
data nodes. Results are sent back
to the gateway node and then
transmitted back to the Spark
master node.
3. Spark master node converts the
dictionary to a DataFrame and
distributed out to the worker
nodes.
18. gateway
node data
nodes
master
node
worker nodes
Spark-DocumentDB
Connector (Java)
1
3
2
4
Spark to DocumentDB Connector
1. Connection is between Spark
master node and
2. map data is transmitted back to
DocumentDB gateway node
3. Query is submitted from Spark
worker nodes to
4. DocumentDB data nodes and the
data is transmitted back to Spark
worker nodes for further
processing
19. Query Test Results
Query pyDocumentDB Azure-DocumentDB-Spark
LIMIT 100 0:00:00.774820 00:00:01.286
All Seattle flights (23K rows) 0:00:05.146107 00:00:01.582
All flights (~1.39M rows) 0:02:36.335267 00:00:08.899
More info at: https://github.com/Azure/azure-documentdb-spark/wiki/Query-Test-Runs
20. Query Test Results
Issue # Issue Description
7 Improve push down predicates (e.g. take advantage of TOP/LIMIT, aggregations,
etc.)
6 Schema-less query bug
5 Optimize computation push to partitions
3 Add Python wrapper / examples
2 Add Azure-DocumentDB-Spark connector as Spark package
More info at: https://github.com/Azure/azure-documentdb-spark/issues
21. Asks
Go to https://github.com/Azure/azure-documentdb-spark/ and try it out!
References:
• Real-time machine learning on globally-distributed data with Apache
Spark and DocumentDB
• Accelerate real-time big-data analytics with the Spark to DocumentDB
connector
Any questions?
• We’re on StackOverflow #azure-documentdb
• Email askdocdb@ or denny.lee@
25. Graph Calculations: Degrees, PageRank
What is the most important
airport (most flights in / out)
tripGraph.inDegrees
.sort(desc("inDegree"))
.limit(10))
This is module 1 video 2 of the Azure DocumentDB Microsoft Virtual Academy course.
In this video, you'll learn why to use NoSQL and why to choose DocumentDB.
Independently scale storage and throughput. Provisioned throughput guaranteed.
Elastically scale throughput from 100 to 10s of millions of requests/sec
Transparent server side partitioning
Optionally evict old data with TTL
Cheaper than hosted OSS NoSQL databases or DynamoDB
Watch “Predictable performance” module
Write optimized, SSD-based database engine with low latency access
Synchronous and automatic indexing at sustained ingestion rates
Globally distributed with reads and writes served from local region
Watch “Predictable performance” module
Scale across any number of Azure regions
Turn-key high availability with transparent failover
Multi-homing
Well-defined consistency models
Watch “Achieve planet scale with DocumentDB: Multi-region replication”
Rich SQL, JavaScript, MongoDB
Multi-modal: key-values, column family, or documents
No impedance mismatch - JavaScript is the type system
Write business logic entirely in JavaScript with stored procedures and triggers
Integrated multi-document transactions with snapshot isolation
.NET, Java, Node, Python SDKs
Protocol support for MongoDB. Now in addition to its current REST interfaces DocumentDB now supports communication using the MongoDB wire protocol. This means that as a developer you can use existing MongoDB drivers and tools like MongoChef to build applications for DocumentDB.
We’ve release this support today as a preview with the goal of providing more choice in how you build applications against DocumentDB.
By using existing Apache MongoDB drivers with DocumentDB, your application benefits from the service’s automatic indexing, reliability and availability SLAs.
You can go to the Azure marketplace today and signup for access to the Preview. > CLICK