Evolution of the Graph Schema Data Day Seattle 2017

Evolution of the Graph Schema
Data Day Seattle 2017
Joshua Shinavier, PhD
20.10.2017

1. Knowledge and graphs
2. Semantic Web to Property Graphs
3. Re-emergence of the graph schema
4. Elements of a schema language
5. Graph and schema management
6. Graph generation
Outline

• Performance is a factor, but
• Many storage back-ends can be adapted to graphs
• E.g. relational DBs, column stores, key-value stores
• Better reasons:
• The domain model is graph-like
• We can take inspiration from the way we naturally
understand the world
Why graph databases?

Early data modelers
κατηγορία!

• How do we relate data with concepts in order to make
inferences or take action?
• We use schemata — rules that constrain data to a
language of “categories” (concepts)
• Some fundamental categories are built in
• E.g. plurality, necessity, limitation, negation,
reciprocity, etc.
• Others are built upon the foundation
Kant’s “schemata”
This schematism […] is an art, hidden in the depths of the human
soul, whose true modes of action we shall only with difﬁculty
discover and unveil. (Kant, 1781)

• Psychologists saw Kant’s “schemata” in the organization
of human memory (Head, 1920), but
• Memory is more than storage and recall
• We react to data by combining a schema with an
attitude (Bartlett, 1933)
Schemas in psychology

• Scripts, plans, goals (Schank & Abelson, 1970s)
• Frames (Minsky, 1974)
• Early KR languages
• Upper-level ontologies and commonsense
knowledge bases
Enter databases and AI

Semantic Web to Property Graphs

• A vocabulary for vocabulary sharing
• Includes a handful of basic terms
• Classes, properties, inheritance
• Meets the needs of most Web schemas
RDF Schema (RDFS)

• A much more expressive language for ontology development
• Supports:
• Classes and properties with inheritance
• Equality (sameAs, differentFrom, equivalentTo)
• Property domain/range restrictions, cardinality restrictions
• Inverse, transitive, and symmetric properties
• Ontology metadata (imports, versioning)
• Sublanguages OWL Full, OWL DL (description logic), OWL
Lite
• OWL 2 profiles EL (polynomial-time checking), QL (memory-
efficient query answering), DL (completeness and decidability)
• Is this slide too dense? OWL is huge.
OWL

• What commercial applications ended up using
• Supported by AllegroGraph, TopBraid, etc.
• All of RDFS
• Classes, properties, inheritance
• A few terms stolen from OWL
• e.g. sameAs, inverseOf, TransitiveProperty
“RDFS+”

• Property Graph data model takes a minimalist
approach
• Typically no inference or rules support
• Graph DBs, schema.org are a response to real-world
demands
…simplified
1
3
2
foo
foo
bar

Re-emergence of the graph schema

• There is power in simplicity
• NoSQL databases are said to have no predefined
schema
• In practice, every graph DB has a schema
• A set of constraints or assumptions about correct
structure
• Useful for validation and optimization
• There is no graph schema standard
NoSQL ⇏ no schema

• Property Graph data model is a basic schema
• Edge labels (required)
• Vertex labels (optional)
• Property keys (required)
• Property data types (optional, with optional constraints)
• Vertex meta-properties (optional)
Schemas in TinkerPop

• Labels
• Simple types on nodes and/or relationships
• Indexes
• Single-property — equality, existence, containment,
ranges
• Composite (multiple properties) — equality only
• Constraints
• Node property uniqueness
• Node/relationship property existence
• Node key (set of properties unique for the node)
Schemas in Neo4j

• Vertex and edge labels
• Property keys
• Property cardinality (SINGLE, LIST, SET)
• Indexes
• Graph-centric
• Individual properties, composite
• Vertex-centric (index on incoming/outgoing edges)
• Sorting key, sort order
• Automatic/implicit schema creation
Schemas in JanusGraph

• Object databases ≠ graph databases, but similar
• Built-in, object-oriented schemas
• Classes, extension, relationships, recursivity, etc.
• Used for encapsulation, composition, inheritance,
delegation, etc.
• OOP frameworks for graph DBs
• Frames, Ferma, etc.
Schemas in object databases

• Hypernode
• Objects, relations, and functions
• GROOVY
• Multi-level OOP schemas
• Hypergraph DB
• Types and relationships
• Grakn.AI
• Entities, relations, roles, and resources (data type,
uniqueness, regex)
• Single inheritance
Schemas in hypergraph databases

• Support for a basic schema vocabulary
• Entity and relationship types, constraints
• Good coverage of existing schema frameworks
• Extensibility of schemas and types
• Mappings to RDF, schema.org, and storage frameworks
• Reference APIs for
• Schema validation
• Graph schema initialization and migration
• Statistical models, graph generation
Design goals

• Things about which we can make assertions
• “Classes” in RDF, “types” in schema.org, “vertex labels”
in TinkerPop, etc.
• Extend other entity types
Entity types
entities:
- label: Trip
sameAs: http://schema.org/TravelAction
description: A trip taken by a driver or requested by a rider

• Assertions about things
• “Properties” in RDF and schema.org
• “Edges” vs. “properties” in graph databases
• Hyperedges, meta-properties are also “relations”
Relationship types
relations:
- label: requested
description: Relates a rider to a trip he or she has requested
extends:
- core.relatedTo
cardinality: OneToMany
from: users.User
to: Trip

• Graph-centric
• Single-relation, composite
• Entity-centric
• Ordering on a secondary key
Index hints
indexes:
- key: core.uuid
- key: trips.requested
direction: Out
orderBy: core.createdAt
order: Decreasing

• Schemas import other schemas, like software modules
• Give developers/teams autonomy, but
• Coordinate schema integration top-down
Schema imports
name: production
version: 1.2
includes:
- name: trips
version: 1.2
- name: referrals
version: 1.2

• Study the source data
• Extend and validate the shared schema
• Generate artificial graph data
• Study system performance, iterate on the model
• Develop ingestion mappings for real data
• Review and check in schema changes
• Apply the schema to a live database
• Ingest data into the live database
Graph onboarding workflow

• The schema is constantly changing
• Is this database compatible with this schema?
• How to update the database w.r.t. the schema?
• Use revision control to find diffs
• Ordered lists of basic changes
• Translate diffs to storage-specific workflows
• Ordered lists of idempotent operations
• Apply diff workflows to the database
Schema initialization, migration public enum SchemaChange {
AbstractAttributeChanged,
CardinalityChanged,
DomainChanged,
EntityAdded,
EntityRemoved,
ExtensionAdded,
ExtensionRemoved,
IncludeAdded,
IncludeRemoved,
IndexAdded,
IndexRemoved,
RangeChanged,
RelationAdded,
RelationRemoved,
RequiredAttributeChanged,
RequiredOfAttributeChanged,
SchemaAdded,
SchemaRemoved,
SchemaNameChanged,
SchemaVersionChanged,
}

Schema diff and patch
New
Database
Schema x.1
Schema x.2
Database at
Schema x.1
initialize
Diff of x.1
and x.2
Database at
Schema x.2
apply
diff
find
diff

Migration is not always possible
Don’t feel bad!
Basic schemas can’t be changed!
• E.g.
• Removal or abstraction of types already in use
• Changes unsupported at the storage level

• Problem:
• Need to predict write throughput, read latency
given 10x more data
• Analytical solutions are difficult
• Solution?
• Generate graphs of different sizes
• Study the trends
• Problem:
• Where do we get the data?
• Shrinking or growing real data is difficult
Capacity planning

• Existing graph benchmarks
• Lancichinetti-Fortunato-Radicchi (LFR) benchmark
• graphdb-benchmarks
• Linked Data Benchmark Council (LDBC)
• SPARQL benchmarks for triple stores
• None of these are very much like our data
• Not a social network; no power law distributions
• Vastly different topology
• Idea: use the schema to generate statistically
representative data
Benchmarking options

• Gather some statistics
• Entity and relationship type distributions
• Per-relationship in- and out-degree distributions
• Add these to the schema
• Give the Graphgen utility a dataset size, random seed
• Graphgen attempts to create a graph in accordance
with the model
• Gather statistics from the generated graph
• Compare and contrast
• Same dataset can be generated in different
environments
Graph generation workflow

Q&A
Joshua Shinavier
joshsh@uber.com
Kyler Liu
kylerliu@uber.com
Vignesh Ganapathy
vigneshg@uber.com

Evolution of the Graph Schema Data Day Seattle 2017

Evolution of the Graph Schema Data Day Seattle 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evolution of the Graph Schema Data Day Seattle 2017

Similar to Evolution of the Graph Schema Data Day Seattle 2017 (20)

More from Joshua Shinavier

More from Joshua Shinavier (11)

Recently uploaded

Recently uploaded (20)

Evolution of the Graph Schema Data Day Seattle 2017