Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Today:
Marc C. Hadﬁeld, Founder 
Vital AI 
http://vital.ai
marc@vital.ai
917.463.4776
MetaQL:
Queries Across NoSQL,
SQL, Sparql, and Spark

<intro>
Marc C. Hadﬁeld, Founder Vital AI 
http://vital.ai 
marc@vital.ai
MetaQL: Queries Across NoSQL, SQL,
Sparql, and Spark 
Quick Overview

agenda
MetaQL Intro
Motivation
Domain Models (Schema)
MetaQL DSL
MetaQL Implementations
Examples

MetaQL
Leverage Domain Model (Schema)
Compose Queries in Code: Typed
Execute Queries on Databases,
Interchangeably
Minimize TCO: 
Separation of Concerns 
Developer Eﬃciency
Query Framework
Executable JVM Code! (Groovy Closure)

MetaQL Origin
Across many data-driven application
implementations, a desire for: 
Reusable Processes, Tools: 
Stop re-inventing the wheel.
Tools to manage “schema” across an
application & organization.
Tools to combine Semantic Web,
NOSQL, and Hadoop/Spark.
Team Collaboration: 
Human Labor is usually limiting factor.

sample
Recipient
Sender EMail
hasRecipient
hasSender

sample
Recipient
Sender EMail
hasRecipient
hasSender
ARC
ARC

sample
Recipient
Sender EMail
hasRecipient
hasSender
notEqual
type:Person 
Address:john@example.org
type:Person
type:hasSender
type:hasRecipient
type:Email

sample MetaQL graph query
GRAPH {
value segments: ["mydata"]
ARC {
node_constraint { Email.class }
constraint { "?person1 != ?person2" }
ARC_AND {
ARC {
edge_constraint { Edge_hasSender.class }
node_constraint {
Person.props().emailAddress.equalTo(“john@example.org") 
}
node_constraint { Person.class }
node_provides { "person1 = URI" }
}
ARC {
edge_constraint { Edge_hasRecipient.class }
}
}
}
}

Internet of Things
Amazon Echo

Internet of Things:
Batch and
Stream
Processing
Amazon Echo
Amazon Echo Service
haley-app webservice
Vert.X
Vital Prime
Database
DataScript
Hadoop - HDFS
Apache Spark
Streaming, MLLIB, NLP, GraphX
Aspen Datawarehouse
Analytics Layer
Serving Layer
Haley Device
Raspberry Pi
Voice to Text API
Cognitive Application
NLP and Inference to
process User request.
Query Knowledge in DB
Streaming Prediction Models:
“Should I really have
more Coﬀee?”
External APIs…

Demo Examples
Vital Prime
Database
Vert.X
Vital-Vertx
JavaScript WebApp
VitalService-JS
Prediction
Models
DataScript
https://github.com/vital-ai/vital-examples

Demo Example
https://demos.vital.ai/enron-js-app/index.html
https://github.com/vital-ai/vital-examples/tree/master/enron-js-app

Demo Example
Recipient EMailhasRecipient

Cytoscape Plugin
https://github.com/vital-ai/vital-cytoscape
http://cytoscape.org/

Cytoscape Plugin: Wordnet Data, “wine, vino”

where are we using MetaQL?
Financial Services
Healthcare
Internet-of-Things
Start-Ups, Recommendation Apps

application architecture
Batch and
Stream
Processing
Web / Mobile
Application
Application Server
Transactional
Database
Hadoop - HDFS
Apache Spark
Streaming, MLLIB, GraphX
Analytics Layer
Serving Layer
Key/Value
Cache
External APIs Exrernal API
Services
Multiple Databases +
Analytics +
External APIs

enterprise application architecture
Dashboard
Application Server
Enterprise Datawarehouse
Data Silo Data Silo Data Silo Data Silo Data Silo
∞
Many Many Many Data Models…

volume, velocity, variety
polyglot persistance = multiple database technologies
…but we also have very many data models.
many databases, many data models, changing rapidly.
too many moving parts for a developer to reasonably manage!
need fewer APIs to learn!

what happens when changes occur?
Task
Infrastructure
DevOps
Data Scientists
Business +
Domain Experts
Developers
Roles

what changes?
Data Model Changes
New Data Sources
Infrastructure Change
Switch Databases
New Prediction Models / Features
New Service APIs…
Many Interdependencies…
Example: Change in the taxonomy of a categorization
service breaks all the logic tied to the old categories.

total cost of ownership
How much code changes when we modify our data
model to include new sources?
How to minimize by decoupling dependencies?
When we switch database technologies?

Domain Model as “Contract”
Infrastructure
DevOps
Data Scientists
Business +
Domain Experts
Developers
Domain
Model
Everyone to agree (or at least be aware) of
the deﬁnition of Domain Concepts.
Ue semantics to map “views”.

MetaQL Abstraction
Infrastructure
DevOps
Data Scientists
Business +
Domain Experts
Developers
Domain
Model
MetaQL
Abstraction to give breathing room to Infrastructure.

Infrastructure / DevOps
Database Types:
• Key/Value
• Document
• RDF Graph
• NOSQL
• Relational
• Timeseries
ACID vs. BASE
Optimizing Query Generation
Tuning Secondary Indices
Update MetaQL DSL for new DB features
CAP Theorem

Domain Model Implementation
Combine:
SQL-style Schema
with
Hadoop Data Serialization Schema
(Avro, Thrift, Protocol Buﬀers, Kyro, Parquet)
add
Semantics: the “Meaning” of objects
Not a table “person”, but deﬁne the concept of
Person to be used throughout an application.
The implementation decides how to store “Person”
data in it’s database.

Domain Model deﬁnition resolves:
RDF vs Property Graph model
Object Relational Impedance Mismatch
Use OWL to capture Domain Model:
SubClasses
SubProperties
Multiple Inheritance
Marginal technology performance gains are hugely outweighed
by Human productively gains, and wider choice of tools.
Compromise
across modeling
paradigms .

Example: Healthcare Application: 
URI<Person123> IS_A:
• Patient
• BillableAccount
• InsuredEntity
Same URI across three domain concepts: 
Diagnostics Records, Billing System, Insurance System.
Implementation Note:
We generate code for the JVM using “traits” as a way to implement
multiple inheritance (Groovy, Scala, Java8).
The trait is used as a semantic marker to link to the Domain Model.

Domain Model - Core Classes
Node NodeEdge
HyperNodeHyperEdge
Properties:
• URI
• Primary Type
• Types 
Edges/HyperEdges:
• Source URI
• Destination URI
Edges:
• Peer
• Taxonomy
Class Instances
contain Properties.

VitalSigns: Domain Model Dev Kit
$ vitalsigns generate -o ./domain-ontology/enron-dataset-1.0.0.owl
$ ls domain-groovy-jar
enron-dataset-groovy-1.0.0.jar
$ ls domain-json-schema
enron-dataset-1.0.0.js
OWL can be compiled into JVM code
statically (create an artifact for maven), or
done dynamically at runtime.

Development with the Domain Model
Code Completion from
Domain Model

Development with the Domain Model
VitalSigns vs = VitalSigns.get()
Musician john = new Musician().generateURI(“john")
john.name = "John Lennon"
john.birthday = "October 9, 1940"^xsd.xdatetime("MMMM d, yyyy”)
MusicGroup thebeatles = new MusicGroup().generateURI("thebeatles")
thebeatles.name = "The Beatles"
// try to assign the wrong property, throws an exception
try { thebeatles.birthday = "January 1, 1970"^xsd.xdatetime("MMMM d, yyyy”)
} catch(Exception ex) { println ex } // no such property exception
vs.addToCache( thebeatles.addEdge_hasMember(john) )
// use cache to resolve queries
thebeatles.getMembers().each{ println it.name }
// use database to resolve queries
thebeatles.getMembers(ServiceWide).each{ println it.name }
Implicit MetaQL Queries

VitalService API
• Open/Close Endpoint
• Create/Remove Segment
• Create/Read/Update/Delete Object
• Queries (MetaQL as input closure)
• Service Operations (MetaQL as input closure)
• callFunction (DataScript)
• init Transaction/Commit/Rollback
A “Segment” is a Database (container of objects)

MetaQL
VitalSigns: Domain Model Manager
• MetaQL DSL
• Prediction Model DSL
• Pipeline Transformation DSL (ETL)
(in development)
A tricky bit is ﬁnd the best way to express
the DSL within the allowed grammar of the
host language (Groovy). 
It’s an ongoing eﬀort.

Query Types
AGGREGATION
PATH
GRAPH
SELECT

Query Elements
• constraints: node_constraint, edge_constraint, …
• comparators (equalTo, greaterThan, …)
• provides, ?reference
• AND, OR
• OPTIONAL
• Sort Criteria

SELECT query
SELECT {
value limit: 100
value offset: 0
constraint { Person.class }
constraint { Person.props().name.equalTo("John" ) }
}

GRAPH query
GRAPH {
ARC {
node_constraint { Email.class }
constraint { "?person1 != ?person2" }
ARC_AND {
ARC {
edge_constraint { Edge_hasSender.class }
node_constraint {
Person.props().emailAddress.equalTo(“john@example.org") }
}
ARC {
edge_constraint { Edge_hasRecipient.class }
}
}
}
}

GRAPH query (2)
GRAPH {
value segments: [VitalSegment.withId('wordnet')]
value inlineObjects: true
ARC {
node_bind { "node1" }
node_constraint { SynsetNode.expandSubclasses(true) }
node_constraint { SynsetNode.props().name.contains_i("happy") }
ARC {
edge_bind { "edge" }
node_bind { "node2" }
}
}
}
Code iterating over Results can use bind names to
reference objects in each solution: node1, edge, node2.
<—- inline objects

PATH query
def forward = true
def reverse = false
PATH {
value segments: segments
value maxdepth: 5 
value rootURIs: [URIProperty.withString(inputURI)]
if( forward ) {
ARC {
value direction: 'forward'
// accept any edge: edge_constraint { }
// accept any node: node_constraint { }
}
}
if( reverse ) {
ARC {
value direction: 'reverse'
// accept any edge: edge_constraint { }
// accept any node: node_constraint { }
}
}
}

AGGREGATION query
SUM Product.props().cost
AVERAGE Person.props().birthday
COUNT_DISTINCT Document.props().active
FIRST { DISTINCT Document.props().title, expandProperty : false, order: Order.ASC }
Part of a SELECT query

Service Operations DSL
Insert
Update
Delete

Service Operations
INSERT {
value segment: 'testing'
insert(MusicGroup.class, provides: "thebeatles") {
MusicGroup thebeatles ->
thebeatles.name = "The Beatles"
thebeatles.URI = "thebeatles"
}
insert(Musician.class, provides: "john") {
Musician john ->
john.name = "John"
john.URI = "john"
}
insert(Edge_hasMember) { Edge_hasMember member ->
member.sourceURI = ref("thebeatles").toString()
member.destinationURI = ref("john").toString()
member.URI = "edge1"
}
}
<— Using “provides” values

Transactions
def xid = service.startTransaction()
service.save(xid, person123)
service.commitTransaction(xid)
Implemented at the service level:

MetaQL Implementations
MetaQL
Executable
Query
Query Generator

Sparql/RDF Implementation
G S P O
Quad Store
Franz Allegrograph

VitalGraphQuery q = builder.query {
GRAPH {
value segments: ["documents"]
ARC {
node_constraint
{ Person.props().emailID.equalTo(“k.lay@enron.com" ) }
ARC {
node_constraint { EMailMessage.class }
edge_constraint { Edge_hasEMailMessage.class }
}
} 
}
}.toQuery()
println "Query: " + q.toSparql()

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX vital-core: <http://vital.ai/ontology/vital-core#>
PREFIX p0: <http://vital.ai/ontology/enron-emails#>
SELECT DISTINCT ?s1 ?d2 ?e2
FROM <segment:customer__app__documents>
WHERE {
{
?s1 p0:hasEmailID ?value1 .
?s1 rdf:type ?value2 .
FILTER (
?value2 = p0:Person && ?value1 = “k.lay@enron.com"^^xsd:string
)
{
?d2 rdf:type ?value3 .
?e2 rdf:type ?value4 .
FILTER (
?value3 = p0:EMailMessage && ?value4 = p0:Edge_hasEMailMessage
)
?e2 vital-core:hasEdgeSource ?s1 .
?e2 vital-core:hasEdgeDestination ?d2 .
}
}
}

Spark-SQL / Dataframe
URI P V
Segment RDD Property RDD
K V
Experimenting with: new Dataframe Optimizer: Catalyst, new Dataframe
DSL for query generation, and using GraphX for isolated Graph Query cases
Generate “Bad” queries, with optimizer ﬁxing them and Spark
partitioning RDDs, as long as Spark is aware of Schema.

Key/Value Implementation
K V
URI —> Serialized Object

Lucene/SOLR Implementation
DocID
1
2
3
P1
V1
V1
P2
V2
V2
P3
V3
V3
P4
V4
V4
Inverted Index of Property Values…

NoSQL BigTable Implementation
DynamoDB (HBase, Cassandra, Accumulo, …)
ROWID
1
2
3
C1
K1=V1
K1=V1
K1=V1
C2
K1=V1
K1=V1
K1=V1
C3
K1=V1
K1=V1
K1=V1
C4
K1=V1, K1=V1
K1=V1, K1=V1
K1=V1, K1=V1
URI P V
Per Segment object table
Per Segment property table
+ Secondary Indices
+ Secondary Indices

SQL Implementation
SQL, Hive-SQL, Redshift, …
G S P O
Per Segment Table
with Partitioning (Hive)

implementation
DSL Documentation to be posted:
http://www.metaql.org/
VitalSigns, VitalService, MetaQL
https://dashboard.vital.ai/
Vital AI github: https://github.com/vital-ai/
Sample Code 
Spark Code: Aspen, Aspen-Datawarehouse
Documentation Coming!

closing thoughts
Separation of Concerns yields
the Agility needed to keep up
with rapidly evolving Data.
“Domain Model as Contract” provides a framework for
consistent interpretation of Data across an application.
MetaQL provides a framework for the consistent
access and query of Data across an application.
Context: Data-Driven Application / Cognitive Applications:

Thank You!
Marc C. Hadﬁeld, Founder 
Vital AI 
http://vital.ai
marc@vital.ai
917.463.4776

Pipeline DSL (ETL)
PIPELINE { // Workflow
PIPE { // a Workflow Component with dependencies
TRANSFORM { // Joins across Datasets 
IF (RULE { } ) // Boolean, Query, Construct, … 
THEN { RULE { } } 
ELSE { RULE { } } 
}
PIPE { … } // dependent PIPE
} // Output Dataset
PIPE { … 
}
}
Inﬂuenced by Spark Pipeline and
Google Dataﬂow Pipeline

Schema Upgrade/Downgrade
UPGRADE {
upgrade(
oldClass: OLD_Person.class,
newClass: NEW_Person.class ) {
person_old, person_new ->
person_new.newName = person_old.oldName
}
}
DOWNGRADE {
downgrade(
newClass: NEW_Person.class,
oldClass: OLD_Person.class ) {
person_new, person_old ->
person_old.oldName = person_new.newName
}
}

Multiple Endpoints
def service1 = VitalService.getService(profile:”kv-users”)
def service2 = VitalService.getService(profile:”posts-db”)
def service3 = VitalService.getService(profile:”friendgraph-db”)
// given user URI:user123@email.org
// get user object from service1
// find friends of user in friendgraph via service3
// find posts of friends in posts-db
// update service1 with cache of user-to-friends-postings
// send postings of friends to user in UI

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Similar to Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark (20)

Recently uploaded

Recently uploaded (20)

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark