Inferring Versioned
Schemas from NoSQL
Databases and its
Applications
ER’15
Stockholm, October 2015
[{ ”id”: ”90234 af”, ”value”: { ”author”: ”Diego Sevilla Ruiz”,
”e-mail”: ”dsevilla@um.es”,
”institution”: ”U. of Murcia”}},
{ ”id”: ”a243bb5”, ”value”: { ”author”: ”Severino Feliciano Morales”,
”e-mail”: ”severino.feliciano@um.es”,
”institution”: ”U. of Murcia”}},
{ ”id”: ”096705d”, ”value”: { ”author”: ”Jesús García Molina”,
”e-mail”: ”jmolina@um.es”,
”institution”: ”U. of Murcia”}}]
Motivation
NoSQL Databases are Schemaless
Benefits
▶ No need to previously
define an Schema
▶ Non-uniform data
▶ Custom fields
▶ Non-uniform types
▶ Easier evolution
Drawbacks
▶ Harder to reason about
the DB
▶ Static checking is lost
▶ Some of the data logic is
in the application code
(more error prone)
▶ Some utilities need
Schema information to
work
Schemas for NoSQL Databases
▶ How to alleviate the problems of schemaless
databases? ⇒ Inferring a Schema
▶ The Schema Model contains information about
Entities and Relationships
▶ Take into account the different Entity Versions in
the Database
▶ Heterogeneity usually because of slight variations on
Entities
▶ We obtain a precise database model
▶ The Schema allows us to automate the construction
of tools:
▶ migration, refactoring, visualization, …
Related Work
▶ JSON Schema
▶ Object versions and relationships are not considered
▶ Apache Spark SQL/Drill: SQL-like schemas
▶ Union of all fields, nullable ⇒ incorrect combinations
▶ Over-generalization to String
▶ Aggregations and Reference relations not considered
▶ MongoDB-Schema
▶ Prototype to infer schemas from MongoDB
collections
▶ Same limitations than Spark SQL
▶ JSON Discoverer
▶ A MDE solution to infer domain models from REST
web services (i.e. JSON documents)
▶ Not database-oriented; Object versions not
considered
Spark SQL Example
{”name”:”Michael”}
{”name”:”Andy”, ”age”:30}
{”name”:”Justin”, ”age”:19}
{”name”:”Peter”, ”age”:”tiny”}
{”name”:”Martina”, ”address”:”home!”}
> people.printSchema
root
|-- address: string (nullable = true)
|-- age: string (nullable = true)
|-- name: string (nullable = true)
▶ age promoted to string
▶ age and address are never part of the same object
{
”rows”:[
{
”content”:{
”chapters”:33,
”pages”:527
},
”authors”:[
{
”company”:{
”country”:”USA”,
”name”:”IBM”
},
”name”:”Grady Booch”,
”_id”:”210”
},
{
”company”:{
”country”:”USA”,
”name”:”IBM”
},
”name”:”James Rumbaugh”,
”_id”:”310”
},
{
”country”:”USA”,
”company”:”Ivar Jacobson Consulting”,
”name”:”Ivar Jacobson”,
”_id”:”410”
}],
”type”:”book”,
”year”:2013,
”publisher_id”:”345679”,
”title”:”The Unified Modeling Language”,
”_id”:”1”
},
{
”discipline”:”software engineering”,
”issn”:[
”0098 -5589”,
”1939 -3520”
],
”name”:”IEEE Trans. on Software Engineering”,
”type”:”journal”,
”_id”:”11”
},
{
”name”:”Automated Software Engineering”,
”issn”:[
”0928 -8910”,
”1573 -7535”
],
”discipline”:”software engineering”,
”type”:”journal”,
”_id”:”12”,
”number”:10515
},
{
”city”:”Barcelona”,
”name”:”Omega”,
”type”:”publisher”,
”_id”:”123451”
},
{
”type”:”publisher”,
”city”:”Newton”,
”name”:”O’Reilly Media”,
”_id”:”928672”
},
{
”type”:”book”,
”author”:{
”_id”:”101”,
”name”:”Bradley Holt”,
”company”:{
”country”:”USA”,
”name”:”IBM Cloudant”,
}
},
”title”:”Writing and Querying MapReduce Views in
CouchDB”,
”publisher_id”:”928672”,
”_id”:”2”
},
{
”name”:”Addison -Wesley”,
”type”:”publisher”,
”_id”:”345679”
},
{
”type”:”publisher”,
”journals”:[
”11”,
”12”
],
”name”:”IEEE Publications”,
”_id”:”907863”
}]}
NoSQL Database Model
▶ Objects (Entities) and Entity Versions
▶ Attributes
▶ Relationships
▶ Aggregation
▶ References
{
”type”:”publisher”,
”city”:”Newton”,
”name”:”O’Reilly Media”,
”_id”:”928672”
},
{
”type”:”book”,
”author”:{
”_id”:”101”,
”name”:”Bradley Holt”,
”company”:{
”country”:”USA”,
”name”:”IBM Cloudant”,
}
},
”title”:”Writing and Querying MapReduce Views in CouchDB”,
”publisher_id”:”928672”,
”_id”:”2”
},
Schema & Entity Versions Description
Entity Publisher {
Version 1 {
name: String
city: String
}
Version 2 {
name: String
}
Version 3 {
name: String
journal[+]: [Ref]->[Journal] (opposite=False)
}
}
Entity Journal {
Version 1 {
issn: Tuple [String, String]
name: String
discipline: String
}
Version 2 {
issn: Tuple [String, String]
name: String
discipline: String
number: int
}
}
Entity Book {
Version 1 {
title: String
year: int
publisher[1]: [Ref]->[Publisher] (opossite=False)
content[1]: [Aggregate]Content1
author[+]: [Aggregate]Author1
}
Version 2 {
title: String
publisher[1]: [Ref]->[Publisher] (opossite=False)
author[1]: [Aggregate]Author1
}
}
Entity Author {
Version 1 {
name: String
company[1]: [Aggregate]Company
}
Version 2 {
country: String
company: String
name: String
}
}
Entity Company {
Version 1 {
name: String
country: String
}
}
Entity Content {
Version 1 {
chapters: int
pages: int
}
}
(a) (b)
[1..1] company
[1..1] publisher[1..1] content[1..*] authors
[1..*] journals
Solution Design Considerations
▶ We have to process all the objects in the Database
⇒ Map-Reduce
▶ Natural data processing on NoSQL databases
▶ Leverage MDE technologies
▶ Reuse EMF/Ecore tooling to show entity diagrams
▶ Automation & Code Generation by Metamodeling &
Model Transformations
Proposed MDE Architecture
NoSQL
Database
MapReduce
Object
Versions
(JSON)
JSON
Injection
JSON
Model
JSON
Metamodel
Schema
Reverse
Eng
Schema
Model
Application
Generation
Schema
Viewer/
Data
Validator/
Migration
Assistant
Applications Schema
Metamodel
instance
instance
Reverse Engineering Process (i)
▶ Map-Reduce process
▶ Map: obtains the Raw Schema for each object
▶ Reduce: selects an archetype for each Entity Version
▶ Entity Type
▶ Root objects ⇒ “type” field or collection name
▶ Aggregated objects ⇒ key of the pair (e.g. “author”)
JSON object Raw Schema
{name:“Omega”, city:“Barcelona”} {name:String, city:String}
{title:“Writing and...”,
publisher_id:“928672”,
author:{name:“Bradley Holt”,
company:{country:“USA”,
name:“IBM Cloudant”} } }
{title:String,
publisher_id:String,
author:{name:String,
company:{country:String,
name:String} } }
Reverse Engineering Process (ii)
▶ Attributes: primitive or tuple
▶ Aggregated Entities
▶ Value of the pair is an Object (or array of objects)
▶ Entity type inferred from the key
▶ References
▶ Heuristics/Conventions
▶ Key: <entity_name>_id
▶ Value: MongoDB’s DBRef abstraction:
{”$ref”: ”<entity_name>”, ”$id”, <id_value>}
▶ Honor cardinalities (arrays)
Example NoSQL Applications
▶ From the DBSchema model, using Model
Transformations and Model-to-Text transformations
(Code Generation), we can:
▶ Generate models that Characterize each Entity
Version
▶ That characterization can be used to Visualize the
Database
▶ And also to generate code to Validate objects
entering the Database
▶ Generate models that allow Database Migration to
the desired Entity Versions
Type Discrimination/Characterization
Metamodel
function isOfExactTypeBook_2(obj) {
if (! (”type” in obj)) {
return false;
}
if (obj[type] !== ”Book”) {
return false;
}
if (! (”title” in obj)) {
return false;
}
if (! (”author” in obj)) {
return false;
}
if (”publisher” in obj) {
return false;
}
if (”content” in obj) {
return false;
}
if (”year” in obj) {
return false;
}
return true;
}
Generated using a Model-
to-Text transformation
from an instance of the
previous Type Discrimina-
tion Metamodel
Entity Versions
Alternate: D3.js Treemap
Type Transformation Metamodel
db.<collection >. update(
<query >,
<update >,
{
multi: true
}
)
Obtained by Entity Type Characterization
Generate the correct update
MongoDB statement using $set,
$push, etc., maybe via user assis-
tance through a DSL.
For example, for Journal_1 to
Journal_2:
$set: { ”number”: 1 }
Conclusions & Future work
▶ A process for obtaining Conceptual Model Schemas
for NoSQL Databases is shown
▶ The process takes into account the different Entity
Versions present in the Database
▶ A MDE process allows us to automate the
production of several applications from the Schemas
▶ Example applications that allow Database
Visualization and Migration are shown
Conclusions & Future work (ii)
▶ Future work includes:
▶ Building a NoSQL Database Tool Set (NoSQL Data
Engineering)
▶ DSL for Entity Version migration
▶ Refining the Schema to allow a richer Type System
▶ Allow value ranges or enumerated sets
▶ Infer attribute dependencies (derived attributes,
i.e. the value of an attribute dictates the value of
another attribute)
▶ etc.

Inferring Versioned Schemas from NoSQL Databases and its Applications

  • 1.
    Inferring Versioned Schemas fromNoSQL Databases and its Applications ER’15 Stockholm, October 2015 [{ ”id”: ”90234 af”, ”value”: { ”author”: ”Diego Sevilla Ruiz”, ”e-mail”: ”dsevilla@um.es”, ”institution”: ”U. of Murcia”}}, { ”id”: ”a243bb5”, ”value”: { ”author”: ”Severino Feliciano Morales”, ”e-mail”: ”severino.feliciano@um.es”, ”institution”: ”U. of Murcia”}}, { ”id”: ”096705d”, ”value”: { ”author”: ”Jesús García Molina”, ”e-mail”: ”jmolina@um.es”, ”institution”: ”U. of Murcia”}}]
  • 2.
    Motivation NoSQL Databases areSchemaless Benefits ▶ No need to previously define an Schema ▶ Non-uniform data ▶ Custom fields ▶ Non-uniform types ▶ Easier evolution Drawbacks ▶ Harder to reason about the DB ▶ Static checking is lost ▶ Some of the data logic is in the application code (more error prone) ▶ Some utilities need Schema information to work
  • 3.
    Schemas for NoSQLDatabases ▶ How to alleviate the problems of schemaless databases? ⇒ Inferring a Schema ▶ The Schema Model contains information about Entities and Relationships ▶ Take into account the different Entity Versions in the Database ▶ Heterogeneity usually because of slight variations on Entities ▶ We obtain a precise database model ▶ The Schema allows us to automate the construction of tools: ▶ migration, refactoring, visualization, …
  • 4.
    Related Work ▶ JSONSchema ▶ Object versions and relationships are not considered ▶ Apache Spark SQL/Drill: SQL-like schemas ▶ Union of all fields, nullable ⇒ incorrect combinations ▶ Over-generalization to String ▶ Aggregations and Reference relations not considered ▶ MongoDB-Schema ▶ Prototype to infer schemas from MongoDB collections ▶ Same limitations than Spark SQL ▶ JSON Discoverer ▶ A MDE solution to infer domain models from REST web services (i.e. JSON documents) ▶ Not database-oriented; Object versions not considered
  • 5.
    Spark SQL Example {”name”:”Michael”} {”name”:”Andy”,”age”:30} {”name”:”Justin”, ”age”:19} {”name”:”Peter”, ”age”:”tiny”} {”name”:”Martina”, ”address”:”home!”} > people.printSchema root |-- address: string (nullable = true) |-- age: string (nullable = true) |-- name: string (nullable = true) ▶ age promoted to string ▶ age and address are never part of the same object
  • 6.
    { ”rows”:[ { ”content”:{ ”chapters”:33, ”pages”:527 }, ”authors”:[ { ”company”:{ ”country”:”USA”, ”name”:”IBM” }, ”name”:”Grady Booch”, ”_id”:”210” }, { ”company”:{ ”country”:”USA”, ”name”:”IBM” }, ”name”:”James Rumbaugh”, ”_id”:”310” }, { ”country”:”USA”, ”company”:”IvarJacobson Consulting”, ”name”:”Ivar Jacobson”, ”_id”:”410” }], ”type”:”book”, ”year”:2013, ”publisher_id”:”345679”, ”title”:”The Unified Modeling Language”, ”_id”:”1” }, { ”discipline”:”software engineering”, ”issn”:[ ”0098 -5589”, ”1939 -3520” ], ”name”:”IEEE Trans. on Software Engineering”, ”type”:”journal”, ”_id”:”11” }, { ”name”:”Automated Software Engineering”, ”issn”:[ ”0928 -8910”, ”1573 -7535” ], ”discipline”:”software engineering”, ”type”:”journal”, ”_id”:”12”, ”number”:10515 }, { ”city”:”Barcelona”, ”name”:”Omega”, ”type”:”publisher”, ”_id”:”123451” }, { ”type”:”publisher”, ”city”:”Newton”, ”name”:”O’Reilly Media”, ”_id”:”928672” }, { ”type”:”book”, ”author”:{ ”_id”:”101”, ”name”:”Bradley Holt”, ”company”:{ ”country”:”USA”, ”name”:”IBM Cloudant”, } }, ”title”:”Writing and Querying MapReduce Views in CouchDB”, ”publisher_id”:”928672”, ”_id”:”2” }, { ”name”:”Addison -Wesley”, ”type”:”publisher”, ”_id”:”345679” }, { ”type”:”publisher”, ”journals”:[ ”11”, ”12” ], ”name”:”IEEE Publications”, ”_id”:”907863” }]}
  • 7.
    NoSQL Database Model ▶Objects (Entities) and Entity Versions ▶ Attributes ▶ Relationships ▶ Aggregation ▶ References { ”type”:”publisher”, ”city”:”Newton”, ”name”:”O’Reilly Media”, ”_id”:”928672” }, { ”type”:”book”, ”author”:{ ”_id”:”101”, ”name”:”Bradley Holt”, ”company”:{ ”country”:”USA”, ”name”:”IBM Cloudant”, } }, ”title”:”Writing and Querying MapReduce Views in CouchDB”, ”publisher_id”:”928672”, ”_id”:”2” },
  • 8.
    Schema & EntityVersions Description Entity Publisher { Version 1 { name: String city: String } Version 2 { name: String } Version 3 { name: String journal[+]: [Ref]->[Journal] (opposite=False) } } Entity Journal { Version 1 { issn: Tuple [String, String] name: String discipline: String } Version 2 { issn: Tuple [String, String] name: String discipline: String number: int } } Entity Book { Version 1 { title: String year: int publisher[1]: [Ref]->[Publisher] (opossite=False) content[1]: [Aggregate]Content1 author[+]: [Aggregate]Author1 } Version 2 { title: String publisher[1]: [Ref]->[Publisher] (opossite=False) author[1]: [Aggregate]Author1 } } Entity Author { Version 1 { name: String company[1]: [Aggregate]Company } Version 2 { country: String company: String name: String } } Entity Company { Version 1 { name: String country: String } } Entity Content { Version 1 { chapters: int pages: int } } (a) (b) [1..1] company [1..1] publisher[1..1] content[1..*] authors [1..*] journals
  • 9.
    Solution Design Considerations ▶We have to process all the objects in the Database ⇒ Map-Reduce ▶ Natural data processing on NoSQL databases ▶ Leverage MDE technologies ▶ Reuse EMF/Ecore tooling to show entity diagrams ▶ Automation & Code Generation by Metamodeling & Model Transformations
  • 10.
  • 11.
    Reverse Engineering Process(i) ▶ Map-Reduce process ▶ Map: obtains the Raw Schema for each object ▶ Reduce: selects an archetype for each Entity Version ▶ Entity Type ▶ Root objects ⇒ “type” field or collection name ▶ Aggregated objects ⇒ key of the pair (e.g. “author”) JSON object Raw Schema {name:“Omega”, city:“Barcelona”} {name:String, city:String} {title:“Writing and...”, publisher_id:“928672”, author:{name:“Bradley Holt”, company:{country:“USA”, name:“IBM Cloudant”} } } {title:String, publisher_id:String, author:{name:String, company:{country:String, name:String} } }
  • 12.
    Reverse Engineering Process(ii) ▶ Attributes: primitive or tuple ▶ Aggregated Entities ▶ Value of the pair is an Object (or array of objects) ▶ Entity type inferred from the key ▶ References ▶ Heuristics/Conventions ▶ Key: <entity_name>_id ▶ Value: MongoDB’s DBRef abstraction: {”$ref”: ”<entity_name>”, ”$id”, <id_value>} ▶ Honor cardinalities (arrays)
  • 14.
    Example NoSQL Applications ▶From the DBSchema model, using Model Transformations and Model-to-Text transformations (Code Generation), we can: ▶ Generate models that Characterize each Entity Version ▶ That characterization can be used to Visualize the Database ▶ And also to generate code to Validate objects entering the Database ▶ Generate models that allow Database Migration to the desired Entity Versions
  • 15.
  • 16.
    function isOfExactTypeBook_2(obj) { if(! (”type” in obj)) { return false; } if (obj[type] !== ”Book”) { return false; } if (! (”title” in obj)) { return false; } if (! (”author” in obj)) { return false; } if (”publisher” in obj) { return false; } if (”content” in obj) { return false; } if (”year” in obj) { return false; } return true; } Generated using a Model- to-Text transformation from an instance of the previous Type Discrimina- tion Metamodel
  • 19.
  • 20.
  • 21.
    Type Transformation Metamodel db.<collection>. update( <query >, <update >, { multi: true } ) Obtained by Entity Type Characterization Generate the correct update MongoDB statement using $set, $push, etc., maybe via user assis- tance through a DSL. For example, for Journal_1 to Journal_2: $set: { ”number”: 1 }
  • 22.
    Conclusions & Futurework ▶ A process for obtaining Conceptual Model Schemas for NoSQL Databases is shown ▶ The process takes into account the different Entity Versions present in the Database ▶ A MDE process allows us to automate the production of several applications from the Schemas ▶ Example applications that allow Database Visualization and Migration are shown
  • 23.
    Conclusions & Futurework (ii) ▶ Future work includes: ▶ Building a NoSQL Database Tool Set (NoSQL Data Engineering) ▶ DSL for Entity Version migration ▶ Refining the Schema to allow a richer Type System ▶ Allow value ranges or enumerated sets ▶ Infer attribute dependencies (derived attributes, i.e. the value of an attribute dictates the value of another attribute) ▶ etc.