SlideShare a Scribd company logo
Schema Agnostic Indexing with
Azure DocumentDB
@dharmashukla, DocumentDB
Presented at VLDB 2015
Sudipta Sengupta, Justin Levandoski,
David Lomet
Microsoft Research
Dharma Shukla, Shireesh Thota, Karthik Raman,
Madhan Gajendran, Ankur Shah, Sergii Ziuzin,
Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna
Wawrzyniak, Samer Boshra,
Renato Ferreira, Mohamed Nassar,
Michael Koltachev, Ji Huang
Microsoft Corporation
 Overview of DocumentDB
 Schema Agnostic Indexing
 Logical Index Organization
 Physical Index Organization
 Summary
Outline
 Fully managed, multi-tenant, geo-distributed document database service on
Azure
 Born out of the needs of internal Microsoft applications; GA since April 2015
 Built from the ground up with resource governance
 Provisioned throughput, performance isolation, OPEX efficiency
 Well defined consistency levels with predictable performance
 Database engine built for JSON & JavaScript
 Automatic indexing of JSON values and rich (SQL and JavaScript) query
 JavaScript language integrated transactions and query directly inside the database engine
What is DocumentDB?
Strong Bounded Staleness Session Eventual
Architecture
Database
Collection
Document
Account
User
Permission
JavaScript Object Literals
JSON serializable
values (aka JSON
Infoset)
{
"locations":
[
{ "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports":[{ "city": "Moscow" },{ "city": "Athens"}]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
• Automatic indexing of document trees without
requiring schema or secondary indices
• SQL and JavaScript query processing on the trees
• Lazy materialization of JavaScript values from the
instances of trees
JSON document as tree
Schema-agnostic indexing
• Index is a union of all the document trees
Common
structure
• Structural information and instance values are normalized into a
unifying concept of JSON-Path
Terms Postings List/Values
$/location/0/ 1, 2
location/0/country/ 1, 2
location/0/city/ 1, 2
0/country/Germany 1, 2
1/country/France 2
… …
0/city/Moscow 2
0/dealers/0 2
0
Germany
location
0
location
country
0
country
Range (>, <, !=) &
ORDERBY queries
0
Germany
location
0
location
country
0
country
Wildcard queries Spatial queries
0
coordinates
Dynamic
Encoding of
Postings List
(E-WAH/differential)
Logical Index Organization
Query
{
"results":
[
{
"locations":
[
{"country":"Germany","city":"Berlin"},
{"country":"France","city":"Paris"}
]
}
]
}
{ "locations":
[ { "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports": [{ "city": "Moscow" }, { "city": "Athens" }]
}
{ "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ],
"headquarter": "Italy",
"exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" }
]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium
locations headquarter
0
country
Germany
city
Bonn
revenue
200
Italy
0 1
exports
city
Berlin
city
Athens
0
1
dealers
0
Hans
name
0
locations
0 1
country
Germany
city
Berlin
country
France
city
Paris
SELECT C.locations
FROM company C
WHERE C.headquarter = "Belgium"
results
Query result
Input documents
function businessLogic() {
var country = "Belgium";
__.filter(function(x){return x.headquarter===country;});}
SQL JavaScript
doc_id =5
key: “age/22”
payload: +doc5
key: “age/21”
payload: -doc5
key: “city/seattle”
payload: +doc5
key: “zip/98103”
payload: +doc5
…
Path/Posting List updates
Index
Query Processor
Indexscan > “age/30”
< “age/32”
doc1, doc5, doc7
System model for writes and queries
B-Tree
Cache
Log Structured Store
Index Maintanance Requirements
• Support sustained volume of rapid writes
without any term locality
• Queries should honor various consistency
levels
• Index maintenance must operate within
frugal resource budget
• Low write, read and space amplification
Page P
Page
ID
Physical
Address
P
Mapping Table
Δ: Insert record 50
Δ: Delete record 48
Δ: Update record 35 Δ: Insert record 60
Consolidated Page P
Update record 35 Insert record 60
HighlyConcurrentPageUpdatesHighly concurrent index updates
Base page
Log-structured Store on SSD
.
.
.
.
.
Mapping
table
Writeorderinginlog
Base page
Base page
-record
-record
(Latch-free)
Flush Buffer
(8MB)
.
.
Base page
-record
-record
RAM
-record
WriteOptimizedStorageOrganizationWrite optimized storage organization
• Little to no term locality on index write path
• Unable to keep “hot set” of leaf pages
cached in memory
• Performing read to modify each leaf node
leads to very high I/O overhead
• Requires method to maintain efficient write
path for sustained term ingestion with
predictable performance
update term t1
delete term t58
insert term t109
update term t179
update term t568
delete term t732
Lack of term locality
Blindupdates&ValueMerge
Address
Mapping Table
Log Structured Store (LSS)
T  {doc1, doc2, doc3, doc5}
Term T  -doc2
P
Read I/O
Page Stub
Address
Mapping Table
Log Structured Store (LSS)
Term T  +doc5
P
T->+doc2 T->-doc2
Page Stub
{doc1, doc2, doc3} {+doc5} {-doc2}
Term lookup or full
page consolidate
Page P
T  {doc1, doc2, doc3}
Add doc5 to posting list for term T
Page P
T  {doc1, doc2, doc3}
Page P
T  {doc1, doc2, doc3}
…
Consolidated Page P
T  {doc1, doc3, doc5}
Blind update for term T
Blind updates and value merge
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2000 4000 6000 8000 10000
NumberofIOs
Index Size (MB)
Update Blind Update
Summary

More Related Content

What's hot

The CIOs Guide to NoSQL
The CIOs Guide to NoSQLThe CIOs Guide to NoSQL
The CIOs Guide to NoSQL
DATAVERSITY
 

What's hot (20)

The CIOs Guide to NoSQL
The CIOs Guide to NoSQLThe CIOs Guide to NoSQL
The CIOs Guide to NoSQL
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosqlAzure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosql
 
Mongo db
Mongo dbMongo db
Mongo db
 
CouchDB
CouchDBCouchDB
CouchDB
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
 
Microsoft Hekaton
Microsoft HekatonMicrosoft Hekaton
Microsoft Hekaton
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
No SQL, No Problem: Use Azure DocumentDB
No SQL, No Problem: Use Azure DocumentDBNo SQL, No Problem: Use Azure DocumentDB
No SQL, No Problem: Use Azure DocumentDB
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
 
Azure DocumentDB for Healthcare Integration
Azure DocumentDB for Healthcare IntegrationAzure DocumentDB for Healthcare Integration
Azure DocumentDB for Healthcare Integration
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
Session #2, tech session: Build realtime search by Sylvain Utard from AlgoliaSession #2, tech session: Build realtime search by Sylvain Utard from Algolia
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
 

Viewers also liked

Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
Vinoth Kannan
 
CAPとBASEとEventually Consistent
CAPとBASEとEventually ConsistentCAPとBASEとEventually Consistent
CAPとBASEとEventually Consistent
Yohei Yamamoto
 

Viewers also liked (10)

#PortraitDeCDO - Guénaëlle Gault - Kantar
#PortraitDeCDO - Guénaëlle Gault - Kantar#PortraitDeCDO - Guénaëlle Gault - Kantar
#PortraitDeCDO - Guénaëlle Gault - Kantar
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
 
CAPとBASEとEventually Consistent
CAPとBASEとEventually ConsistentCAPとBASEとEventually Consistent
CAPとBASEとEventually Consistent
 
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
 
#PortraitDeCDO - Thierry Picard - Pierre Fabre
#PortraitDeCDO - Thierry Picard - Pierre Fabre#PortraitDeCDO - Thierry Picard - Pierre Fabre
#PortraitDeCDO - Thierry Picard - Pierre Fabre
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
 
Nosqlの基礎知識(2013年7月講義資料)
Nosqlの基礎知識(2013年7月講義資料)Nosqlの基礎知識(2013年7月講義資料)
Nosqlの基礎知識(2013年7月講義資料)
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 

Similar to Schema Agnostic Indexing with Azure DocumentDB

Similar to Schema Agnostic Indexing with Azure DocumentDB (20)

Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
MongoDB 3.4 webinar
MongoDB 3.4 webinarMongoDB 3.4 webinar
MongoDB 3.4 webinar
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint database
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
nodejs.pptx
nodejs.pptxnodejs.pptx
nodejs.pptx
 
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learn
 
NoSQL and Spatial Database Capabilities using PostgreSQL
NoSQL and Spatial Database Capabilities using PostgreSQLNoSQL and Spatial Database Capabilities using PostgreSQL
NoSQL and Spatial Database Capabilities using PostgreSQL
 
MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose Database
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DBPower Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
 
MCT Virtual Summit 2021
MCT Virtual Summit 2021MCT Virtual Summit 2021
MCT Virtual Summit 2021
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
[「RDB技術者のためのNoSQLガイド」出版記念セミナー] Azure DocumentDB
[「RDB技術者のためのNoSQLガイド」出版記念セミナー] Azure DocumentDB[「RDB技術者のためのNoSQLガイド」出版記念セミナー] Azure DocumentDB
[「RDB技術者のためのNoSQLガイド」出版記念セミナー] Azure DocumentDB
 

Recently uploaded

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 

Recently uploaded (20)

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 

Schema Agnostic Indexing with Azure DocumentDB

  • 1. Schema Agnostic Indexing with Azure DocumentDB @dharmashukla, DocumentDB Presented at VLDB 2015 Sudipta Sengupta, Justin Levandoski, David Lomet Microsoft Research Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna Wawrzyniak, Samer Boshra, Renato Ferreira, Mohamed Nassar, Michael Koltachev, Ji Huang Microsoft Corporation
  • 2.  Overview of DocumentDB  Schema Agnostic Indexing  Logical Index Organization  Physical Index Organization  Summary Outline
  • 3.  Fully managed, multi-tenant, geo-distributed document database service on Azure  Born out of the needs of internal Microsoft applications; GA since April 2015  Built from the ground up with resource governance  Provisioned throughput, performance isolation, OPEX efficiency  Well defined consistency levels with predictable performance  Database engine built for JSON & JavaScript  Automatic indexing of JSON values and rich (SQL and JavaScript) query  JavaScript language integrated transactions and query directly inside the database engine What is DocumentDB? Strong Bounded Staleness Session Eventual
  • 5. JavaScript Object Literals JSON serializable values (aka JSON Infoset) { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports":[{ "city": "Moscow" },{ "city": "Athens"}] } locations headquarter exports 0 1 country Germany city Berlin country France city Paris city Moscow city Athens Belgium 0 1 • Automatic indexing of document trees without requiring schema or secondary indices • SQL and JavaScript query processing on the trees • Lazy materialization of JavaScript values from the instances of trees JSON document as tree Schema-agnostic indexing
  • 6. • Index is a union of all the document trees Common structure • Structural information and instance values are normalized into a unifying concept of JSON-Path Terms Postings List/Values $/location/0/ 1, 2 location/0/country/ 1, 2 location/0/city/ 1, 2 0/country/Germany 1, 2 1/country/France 2 … … 0/city/Moscow 2 0/dealers/0 2 0 Germany location 0 location country 0 country Range (>, <, !=) & ORDERBY queries 0 Germany location 0 location country 0 country Wildcard queries Spatial queries 0 coordinates Dynamic Encoding of Postings List (E-WAH/differential) Logical Index Organization
  • 7. Query { "results": [ { "locations": [ {"country":"Germany","city":"Berlin"}, {"country":"France","city":"Paris"} ] } ] } { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports": [{ "city": "Moscow" }, { "city": "Athens" }] } { "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ], "headquarter": "Italy", "exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" } ] } locations headquarter exports 0 1 country Germany city Berlin country France city Paris city Moscow city Athens Belgium locations headquarter 0 country Germany city Bonn revenue 200 Italy 0 1 exports city Berlin city Athens 0 1 dealers 0 Hans name 0 locations 0 1 country Germany city Berlin country France city Paris SELECT C.locations FROM company C WHERE C.headquarter = "Belgium" results Query result Input documents function businessLogic() { var country = "Belgium"; __.filter(function(x){return x.headquarter===country;});} SQL JavaScript
  • 8. doc_id =5 key: “age/22” payload: +doc5 key: “age/21” payload: -doc5 key: “city/seattle” payload: +doc5 key: “zip/98103” payload: +doc5 … Path/Posting List updates Index Query Processor Indexscan > “age/30” < “age/32” doc1, doc5, doc7 System model for writes and queries
  • 9. B-Tree Cache Log Structured Store Index Maintanance Requirements • Support sustained volume of rapid writes without any term locality • Queries should honor various consistency levels • Index maintenance must operate within frugal resource budget • Low write, read and space amplification
  • 10. Page P Page ID Physical Address P Mapping Table Δ: Insert record 50 Δ: Delete record 48 Δ: Update record 35 Δ: Insert record 60 Consolidated Page P Update record 35 Insert record 60 HighlyConcurrentPageUpdatesHighly concurrent index updates
  • 11. Base page Log-structured Store on SSD . . . . . Mapping table Writeorderinginlog Base page Base page -record -record (Latch-free) Flush Buffer (8MB) . . Base page -record -record RAM -record WriteOptimizedStorageOrganizationWrite optimized storage organization
  • 12. • Little to no term locality on index write path • Unable to keep “hot set” of leaf pages cached in memory • Performing read to modify each leaf node leads to very high I/O overhead • Requires method to maintain efficient write path for sustained term ingestion with predictable performance update term t1 delete term t58 insert term t109 update term t179 update term t568 delete term t732 Lack of term locality
  • 13. Blindupdates&ValueMerge Address Mapping Table Log Structured Store (LSS) T  {doc1, doc2, doc3, doc5} Term T  -doc2 P Read I/O Page Stub Address Mapping Table Log Structured Store (LSS) Term T  +doc5 P T->+doc2 T->-doc2 Page Stub {doc1, doc2, doc3} {+doc5} {-doc2} Term lookup or full page consolidate Page P T  {doc1, doc2, doc3} Add doc5 to posting list for term T Page P T  {doc1, doc2, doc3} Page P T  {doc1, doc2, doc3} … Consolidated Page P T  {doc1, doc3, doc5} Blind update for term T Blind updates and value merge 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2000 4000 6000 8000 10000 NumberofIOs Index Size (MB) Update Blind Update