Automated Schema Design for
NoSQL Databases
Michael J. Mior
Supervised by Kenneth Salem
Schema optimization
For a given workload, construct the optimal
physical schema to answer queries in the
workload for a given storage budget
What is NoSQL?
● Not SQL?
● NoACID?
● Not only SQL
Types of NoSQL Databases
● Document stores
● Key-value stores
● Graph databases
● …
● Wide column stores
Cassandra data model
Cassandra data model
Queries specify
A set of row keys
and either
A list of column keys
or
A prefix of the column keys with a range
Cassandra schema design
Physical schema
POIsWithHotels
POI17
Hotel3
“Holiday Inn”
…
…
No single obvious mapping from the conceptual schema
Conceptual schema
Hotel
PointOfInterest
Near
Hotel8
“Motel 6”
Cassandra schema design
POIsWithHotels
POI17
Hotel3
“Holiday Inn”
…
…
HotelsWithPOIs
Hotel3
POI17
“Liberty Bell”
…
…
● Multiple choices of logical and
physical schema are possible
for a given conceptual schema
● Different choices are more
suitable for different queries
● Harder to add additional
structures in the future
Cassandra schema design
POIs
POI17
Name
“Liberty Bell”
…
…
Hotels
Hotel3
Name
“Holiday Inn”
…
…
HotelToPOI
Hotel3
POI17
(null)
…
…
POIToHotel
POI17
Hotel3
(null)
…
…
Cassandra schema design
POIToHotel
POI17
Hotel3
“Holiday Inn”
…
…
HotelToPOI
Hotel3
POI17
“Liberty Bell”
…
…
HotelToPOI
HotelID
POIID
(null)
…
…
POIToHotel
POIID
HotelID
(null)
…
…
Challenges
● Storage costs
● View maintenance
● Transactions
● Data distribution
Relational schema design
● Commercial database vendors have existing tools
○ Microsoft AutoAdmin for SQL Server
○ DB2 Design Advisor
○ Oracle SQL Tuning Advisor
● These tools don’t solve all the problems for NoSQL databases
● Usually an existing schema is assumed, and tools suggest secondary
indices and materialized views
Relational schema design
Conceptual schema Logical schema Physical schema
Hotel
CREATE TABLE Hotel(…);
CREATE TABLE POI(…);
CREATE TABLE HotelToPOI(…,
FOREIGN KEY (HotelID)
REFERENCES Hotel(HotelID)
FOREIGN KEY (POOID)
REFERENCES POI(POIID)
);
CREATE MATERIALIZED VIEW
POIsWithHotels AS SELECT
POI.*, Hotel.* FROM
POI, HotelToPOI, Hotel
WHERE HotelToPOI.HotelID
= Hotel.HotelID
AND HotelToPOI.POIID
= POI.POIID;
Mapping from conceptual to logical schema is straightforward
Near
PointOfInterest
Cassandra schema design
Physical schema
POIsWithHotels
POI17
Hotel3
“Holiday Inn”
…
…
No single obvious mapping from the conceptual schema
Conceptual schema
Hotel
Near
Hotel8
“Motel 6”
PointOfInterest
Abstract example
Query paths
Get all points of interests near
hotels a guest has stayed at
SELECT Name FROM POI WHERE
POI.HotelToPOI
.Hotel.Room.Reservation
.Guest.GuestID = ?
Guest Reservation
Room
Hotel
Point of
Interest
Index paths
● Queries can “skip” entities
on a path using an index
● Indices to the final entity
are equivalent to
materialized views
● Schema design is now the
selection of these indices
Guest Reservation
Room
Hotel
Point of
Interest
Problem definition
Input:
● ER diagram with entity counts and cardinalities
● Weighted queries over the diagram
● Storage constraint
● Cost model for target DB
Output:
● Suggested index configuration
● Plans for executing each query
Proposed solution
1. Enumerate possible useful indices for all queries
2. Determine the benefit of a given index for each query
3. Select the optimal indices for a given space constraint
Future Work
● Retargeting to other NoSQL DBs
● Automatic implementation of query plans
● Handling workloads which include updates
● Richer query language
● Exploitation of DBMS-specific features
Summary
● Schema design in NoSQL databases
presents unique challenges
● A workload-driven approach is necessary
● Conceptual schemata and queries are viable
abstractions for NoSQL schema design tools
Query language
SELECT [attributes] FROM [entity]
WHERE [path](=|<|<=|>|>=) [value] AND …
ORDER BY [path] LIMIT [count]
Conjunctive queries with simple predicates and ordering
Higher level queries can be composed by the application
Solution
Index
Enumerator
Query Planner
Index Selection
Index benefit
analysis
Query Plans
Selected IndicesWorkload
Input Output
Implementation
● Focus on Cassandra as target system
● Simple cost model for read-only in-memory workloads
● DBMS-independent index enumerator and query planner
● Index selection ILP implemented with GLPK
Evaluation - TBD
● Re-implement existing workloads in the target DBMS (e.
g. RUBiS, RUBBoS)
● Develop new benchmarks
● Comparison with a human DBA
● Random ER diagrams/queries with realistic properties

Automated Schema Design for NoSQL Databases

  • 1.
    Automated Schema Designfor NoSQL Databases Michael J. Mior Supervised by Kenneth Salem
  • 2.
    Schema optimization For agiven workload, construct the optimal physical schema to answer queries in the workload for a given storage budget
  • 3.
    What is NoSQL? ●Not SQL? ● NoACID? ● Not only SQL
  • 4.
    Types of NoSQLDatabases ● Document stores ● Key-value stores ● Graph databases ● … ● Wide column stores
  • 5.
  • 6.
    Cassandra data model Queriesspecify A set of row keys and either A list of column keys or A prefix of the column keys with a range
  • 7.
    Cassandra schema design Physicalschema POIsWithHotels POI17 Hotel3 “Holiday Inn” … … No single obvious mapping from the conceptual schema Conceptual schema Hotel PointOfInterest Near Hotel8 “Motel 6”
  • 8.
    Cassandra schema design POIsWithHotels POI17 Hotel3 “HolidayInn” … … HotelsWithPOIs Hotel3 POI17 “Liberty Bell” … … ● Multiple choices of logical and physical schema are possible for a given conceptual schema ● Different choices are more suitable for different queries ● Harder to add additional structures in the future
  • 9.
    Cassandra schema design POIs POI17 Name “LibertyBell” … … Hotels Hotel3 Name “Holiday Inn” … … HotelToPOI Hotel3 POI17 (null) … … POIToHotel POI17 Hotel3 (null) … …
  • 10.
    Cassandra schema design POIToHotel POI17 Hotel3 “HolidayInn” … … HotelToPOI Hotel3 POI17 “Liberty Bell” … … HotelToPOI HotelID POIID (null) … … POIToHotel POIID HotelID (null) … …
  • 11.
    Challenges ● Storage costs ●View maintenance ● Transactions ● Data distribution
  • 12.
    Relational schema design ●Commercial database vendors have existing tools ○ Microsoft AutoAdmin for SQL Server ○ DB2 Design Advisor ○ Oracle SQL Tuning Advisor ● These tools don’t solve all the problems for NoSQL databases ● Usually an existing schema is assumed, and tools suggest secondary indices and materialized views
  • 13.
    Relational schema design Conceptualschema Logical schema Physical schema Hotel CREATE TABLE Hotel(…); CREATE TABLE POI(…); CREATE TABLE HotelToPOI(…, FOREIGN KEY (HotelID) REFERENCES Hotel(HotelID) FOREIGN KEY (POOID) REFERENCES POI(POIID) ); CREATE MATERIALIZED VIEW POIsWithHotels AS SELECT POI.*, Hotel.* FROM POI, HotelToPOI, Hotel WHERE HotelToPOI.HotelID = Hotel.HotelID AND HotelToPOI.POIID = POI.POIID; Mapping from conceptual to logical schema is straightforward Near PointOfInterest
  • 14.
    Cassandra schema design Physicalschema POIsWithHotels POI17 Hotel3 “Holiday Inn” … … No single obvious mapping from the conceptual schema Conceptual schema Hotel Near Hotel8 “Motel 6” PointOfInterest
  • 15.
  • 16.
    Query paths Get allpoints of interests near hotels a guest has stayed at SELECT Name FROM POI WHERE POI.HotelToPOI .Hotel.Room.Reservation .Guest.GuestID = ? Guest Reservation Room Hotel Point of Interest
  • 17.
    Index paths ● Queriescan “skip” entities on a path using an index ● Indices to the final entity are equivalent to materialized views ● Schema design is now the selection of these indices Guest Reservation Room Hotel Point of Interest
  • 18.
    Problem definition Input: ● ERdiagram with entity counts and cardinalities ● Weighted queries over the diagram ● Storage constraint ● Cost model for target DB Output: ● Suggested index configuration ● Plans for executing each query
  • 19.
    Proposed solution 1. Enumeratepossible useful indices for all queries 2. Determine the benefit of a given index for each query 3. Select the optimal indices for a given space constraint
  • 20.
    Future Work ● Retargetingto other NoSQL DBs ● Automatic implementation of query plans ● Handling workloads which include updates ● Richer query language ● Exploitation of DBMS-specific features
  • 21.
    Summary ● Schema designin NoSQL databases presents unique challenges ● A workload-driven approach is necessary ● Conceptual schemata and queries are viable abstractions for NoSQL schema design tools
  • 22.
    Query language SELECT [attributes]FROM [entity] WHERE [path](=|<|<=|>|>=) [value] AND … ORDER BY [path] LIMIT [count] Conjunctive queries with simple predicates and ordering Higher level queries can be composed by the application
  • 23.
    Solution Index Enumerator Query Planner Index Selection Indexbenefit analysis Query Plans Selected IndicesWorkload Input Output
  • 24.
    Implementation ● Focus onCassandra as target system ● Simple cost model for read-only in-memory workloads ● DBMS-independent index enumerator and query planner ● Index selection ILP implemented with GLPK
  • 25.
    Evaluation - TBD ●Re-implement existing workloads in the target DBMS (e. g. RUBiS, RUBBoS) ● Develop new benchmarks ● Comparison with a human DBA ● Random ER diagrams/queries with realistic properties