1. The document outlines the evolution of graph schemas from early semantic web schemas like RDFS and OWL to simpler property graph schemas.
2. It discusses elements of graph schemas including entity types, relationship types, indexes, and schema imports.
3. Graph and schema management techniques are covered including schema validation, initialization, migration, and revision control.
4. Graph generation techniques are presented for capacity planning and benchmarking graphs of different sizes based on schema statistics.
Evolution of the Graph Schema Data Day Seattle 2017
1. Evolution of the Graph Schema
Data Day Seattle 2017
Joshua Shinavier, PhD
20.10.2017
2. 1. Knowledge and graphs
2. Semantic Web to Property Graphs
3. Re-emergence of the graph schema
4. Elements of a schema language
5. Graph and schema management
6. Graph generation
Outline
4. • Performance is a factor, but
• Many storage back-ends can be adapted to graphs
• E.g. relational DBs, column stores, key-value stores
• Better reasons:
• The domain model is graph-like
• We can take inspiration from the way we naturally
understand the world
Why graph databases?
6. • How do we relate data with concepts in order to make
inferences or take action?
• We use schemata — rules that constrain data to a
language of “categories” (concepts)
• Some fundamental categories are built in
• E.g. plurality, necessity, limitation, negation,
reciprocity, etc.
• Others are built upon the foundation
Kant’s “schemata”
This schematism […] is an art, hidden in the depths of the human
soul, whose true modes of action we shall only with difficulty
discover and unveil. (Kant, 1781)
7. • Psychologists saw Kant’s “schemata” in the organization
of human memory (Head, 1920), but
• Memory is more than storage and recall
• We react to data by combining a schema with an
attitude (Bartlett, 1933)
Schemas in psychology
8. • Scripts, plans, goals (Schank & Abelson, 1970s)
• Frames (Minsky, 1974)
• Early KR languages
• Upper-level ontologies and commonsense
knowledge bases
Enter databases and AI
10. • A vocabulary for vocabulary sharing
• Includes a handful of basic terms
• Classes, properties, inheritance
• Meets the needs of most Web schemas
RDF Schema (RDFS)
11. • A much more expressive language for ontology development
• Supports:
• Classes and properties with inheritance
• Equality (sameAs, differentFrom, equivalentTo)
• Property domain/range restrictions, cardinality restrictions
• Inverse, transitive, and symmetric properties
• Ontology metadata (imports, versioning)
• Sublanguages OWL Full, OWL DL (description logic), OWL
Lite
• OWL 2 profiles EL (polynomial-time checking), QL (memory-
efficient query answering), DL (completeness and decidability)
• Is this slide too dense? OWL is huge.
OWL
12. • What commercial applications ended up using
• Supported by AllegroGraph, TopBraid, etc.
• All of RDFS
• Classes, properties, inheritance
• A few terms stolen from OWL
• e.g. sameAs, inverseOf, TransitiveProperty
“RDFS+”
14. • Property Graph data model takes a minimalist
approach
• Typically no inference or rules support
• Graph DBs, schema.org are a response to real-world
demands
…simplified
1
3
2
foo
foo
bar
16. • There is power in simplicity
• NoSQL databases are said to have no predefined
schema
• In practice, every graph DB has a schema
• A set of constraints or assumptions about correct
structure
• Useful for validation and optimization
• There is no graph schema standard
NoSQL ⇏ no schema
17. • Property Graph data model is a basic schema
• Edge labels (required)
• Vertex labels (optional)
• Property keys (required)
• Property data types (optional, with optional constraints)
• Vertex meta-properties (optional)
Schemas in TinkerPop
18. • Labels
• Simple types on nodes and/or relationships
• Indexes
• Single-property — equality, existence, containment,
ranges
• Composite (multiple properties) — equality only
• Constraints
• Node property uniqueness
• Node/relationship property existence
• Node key (set of properties unique for the node)
Schemas in Neo4j
20. • Object databases ≠ graph databases, but similar
• Built-in, object-oriented schemas
• Classes, extension, relationships, recursivity, etc.
• Used for encapsulation, composition, inheritance,
delegation, etc.
• OOP frameworks for graph DBs
• Frames, Ferma, etc.
Schemas in object databases
21. • Hypernode
• Objects, relations, and functions
• GROOVY
• Multi-level OOP schemas
• Hypergraph DB
• Types and relationships
• Grakn.AI
• Entities, relations, roles, and resources (data type,
uniqueness, regex)
• Single inheritance
Schemas in hypergraph databases
23. • Support for a basic schema vocabulary
• Entity and relationship types, constraints
• Good coverage of existing schema frameworks
• Extensibility of schemas and types
• Mappings to RDF, schema.org, and storage frameworks
• Reference APIs for
• Schema validation
• Graph schema initialization and migration
• Statistical models, graph generation
Design goals
24. • Things about which we can make assertions
• “Classes” in RDF, “types” in schema.org, “vertex labels”
in TinkerPop, etc.
• Extend other entity types
Entity types
entities:
- label: Trip
sameAs: http://schema.org/TravelAction
description: A trip taken by a driver or requested by a rider
25. • Assertions about things
• “Properties” in RDF and schema.org
• “Edges” vs. “properties” in graph databases
• Hyperedges, meta-properties are also “relations”
Relationship types
relations:
- label: requested
description: Relates a rider to a trip he or she has requested
extends:
- core.relatedTo
cardinality: OneToMany
from: users.User
to: Trip
26. • Graph-centric
• Single-relation, composite
• Entity-centric
• Ordering on a secondary key
Index hints
indexes:
- key: core.uuid
- key: trips.requested
direction: Out
orderBy: core.createdAt
order: Decreasing
27. • Schemas import other schemas, like software modules
• Give developers/teams autonomy, but
• Coordinate schema integration top-down
Schema imports
name: production
version: 1.2
includes:
- name: trips
version: 1.2
- name: referrals
version: 1.2
29. • Study the source data
• Extend and validate the shared schema
• Generate artificial graph data
• Study system performance, iterate on the model
• Develop ingestion mappings for real data
• Review and check in schema changes
• Apply the schema to a live database
• Ingest data into the live database
Graph onboarding workflow
31. • The schema is constantly changing
• Is this database compatible with this schema?
• How to update the database w.r.t. the schema?
• Use revision control to find diffs
• Ordered lists of basic changes
• Translate diffs to storage-specific workflows
• Ordered lists of idempotent operations
• Apply diff workflows to the database
Schema initialization, migration public enum SchemaChange {
AbstractAttributeChanged,
CardinalityChanged,
DomainChanged,
EntityAdded,
EntityRemoved,
ExtensionAdded,
ExtensionRemoved,
IncludeAdded,
IncludeRemoved,
IndexAdded,
IndexRemoved,
RangeChanged,
RelationAdded,
RelationRemoved,
RequiredAttributeChanged,
RequiredOfAttributeChanged,
SchemaAdded,
SchemaRemoved,
SchemaNameChanged,
SchemaVersionChanged,
}
32. Schema diff and patch
New
Database
Schema x.1
Schema x.2
Database at
Schema x.1
initialize
Diff of x.1
and x.2
Database at
Schema x.2
apply
diff
find
diff
33. Migration is not always possible
Don’t feel bad!
Basic schemas can’t be changed!
• E.g.
• Removal or abstraction of types already in use
• Changes unsupported at the storage level
35. • Problem:
• Need to predict write throughput, read latency
given 10x more data
• Analytical solutions are difficult
• Solution?
• Generate graphs of different sizes
• Study the trends
• Problem:
• Where do we get the data?
• Shrinking or growing real data is difficult
Capacity planning
36. • Existing graph benchmarks
• Lancichinetti-Fortunato-Radicchi (LFR) benchmark
• graphdb-benchmarks
• Linked Data Benchmark Council (LDBC)
• SPARQL benchmarks for triple stores
• None of these are very much like our data
• Not a social network; no power law distributions
• Vastly different topology
• Idea: use the schema to generate statistically
representative data
Benchmarking options
37. • Gather some statistics
• Entity and relationship type distributions
• Per-relationship in- and out-degree distributions
• Add these to the schema
• Give the Graphgen utility a dataset size, random seed
• Graphgen attempts to create a graph in accordance
with the model
• Gather statistics from the generated graph
• Compare and contrast
• Same dataset can be generated in different
environments
Graph generation workflow