SlideShare a Scribd company logo
1 of 39
Download to read offline
Evolution of the Graph Schema
Data Day Seattle 2017
Joshua Shinavier, PhD
20.10.2017
1. Knowledge and graphs
2. Semantic Web to Property Graphs
3. Re-emergence of the graph schema
4. Elements of a schema language
5. Graph and schema management
6. Graph generation
Outline
Knowledge and graphs
• Performance is a factor, but
• Many storage back-ends can be adapted to graphs
• E.g. relational DBs, column stores, key-value stores
• Better reasons:
• The domain model is graph-like
• We can take inspiration from the way we naturally
understand the world
Why graph databases?
Early data modelers
κατηγορία!
• How do we relate data with concepts in order to make
inferences or take action?
• We use schemata — rules that constrain data to a
language of “categories” (concepts)
• Some fundamental categories are built in
• E.g. plurality, necessity, limitation, negation,
reciprocity, etc.
• Others are built upon the foundation
Kant’s “schemata”
This schematism […] is an art, hidden in the depths of the human
soul, whose true modes of action we shall only with difficulty
discover and unveil. (Kant, 1781)
• Psychologists saw Kant’s “schemata” in the organization
of human memory (Head, 1920), but
• Memory is more than storage and recall
• We react to data by combining a schema with an
attitude (Bartlett, 1933)
Schemas in psychology
• Scripts, plans, goals (Schank & Abelson, 1970s)
• Frames (Minsky, 1974)
• Early KR languages
• Upper-level ontologies and commonsense
knowledge bases
Enter databases and AI
Semantic Web to Property Graphs
• A vocabulary for vocabulary sharing
• Includes a handful of basic terms
• Classes, properties, inheritance
• Meets the needs of most Web schemas
RDF Schema (RDFS)
• A much more expressive language for ontology development
• Supports:
• Classes and properties with inheritance
• Equality (sameAs, differentFrom, equivalentTo)
• Property domain/range restrictions, cardinality restrictions
• Inverse, transitive, and symmetric properties
• Ontology metadata (imports, versioning)
• Sublanguages OWL Full, OWL DL (description logic), OWL
Lite
• OWL 2 profiles EL (polynomial-time checking), QL (memory-
efficient query answering), DL (completeness and decidability)
• Is this slide too dense? OWL is huge.
OWL
• What commercial applications ended up using
• Supported by AllegroGraph, TopBraid, etc.
• All of RDFS
• Classes, properties, inheritance
• A few terms stolen from OWL
• e.g. sameAs, inverseOf, TransitiveProperty
“RDFS+”
The Web of Data…
• Property Graph data model takes a minimalist
approach
• Typically no inference or rules support
• Graph DBs, schema.org are a response to real-world
demands
…simplified
1
3
2
foo
foo
bar
Re-emergence of the graph schema
• There is power in simplicity
• NoSQL databases are said to have no predefined
schema
• In practice, every graph DB has a schema
• A set of constraints or assumptions about correct
structure
• Useful for validation and optimization
• There is no graph schema standard
NoSQL ⇏ no schema
• Property Graph data model is a basic schema
• Edge labels (required)
• Vertex labels (optional)
• Property keys (required)
• Property data types (optional, with optional constraints)
• Vertex meta-properties (optional)
Schemas in TinkerPop
• Labels
• Simple types on nodes and/or relationships
• Indexes
• Single-property — equality, existence, containment,
ranges
• Composite (multiple properties) — equality only
• Constraints
• Node property uniqueness
• Node/relationship property existence
• Node key (set of properties unique for the node)
Schemas in Neo4j
• Vertex and edge labels
• Property keys
• Property cardinality (SINGLE, LIST, SET)
• Indexes
• Graph-centric
• Individual properties, composite
• Vertex-centric (index on incoming/outgoing edges)
• Sorting key, sort order
• Automatic/implicit schema creation
Schemas in JanusGraph
• Object databases ≠ graph databases, but similar
• Built-in, object-oriented schemas
• Classes, extension, relationships, recursivity, etc.
• Used for encapsulation, composition, inheritance,
delegation, etc.
• OOP frameworks for graph DBs
• Frames, Ferma, etc.
Schemas in object databases
• Hypernode
• Objects, relations, and functions
• GROOVY
• Multi-level OOP schemas
• Hypergraph DB
• Types and relationships
• Grakn.AI
• Entities, relations, roles, and resources (data type,
uniqueness, regex)
• Single inheritance
Schemas in hypergraph databases
Elements of a schema language
• Support for a basic schema vocabulary
• Entity and relationship types, constraints
• Good coverage of existing schema frameworks
• Extensibility of schemas and types
• Mappings to RDF, schema.org, and storage frameworks
• Reference APIs for
• Schema validation
• Graph schema initialization and migration
• Statistical models, graph generation
Design goals
• Things about which we can make assertions
• “Classes” in RDF, “types” in schema.org, “vertex labels”
in TinkerPop, etc.
• Extend other entity types
Entity types
entities:
- label: Trip
sameAs: http://schema.org/TravelAction
description: A trip taken by a driver or requested by a rider
• Assertions about things
• “Properties” in RDF and schema.org
• “Edges” vs. “properties” in graph databases
• Hyperedges, meta-properties are also “relations”
Relationship types
relations:
- label: requested
description: Relates a rider to a trip he or she has requested
extends:
- core.relatedTo
cardinality: OneToMany
from: users.User
to: Trip
• Graph-centric
• Single-relation, composite
• Entity-centric
• Ordering on a secondary key
Index hints
indexes:
- key: core.uuid
- key: trips.requested
direction: Out
orderBy: core.createdAt
order: Decreasing
• Schemas import other schemas, like software modules
• Give developers/teams autonomy, but
• Coordinate schema integration top-down
Schema imports
name: production
version: 1.2
includes:
- name: trips
version: 1.2
- name: referrals
version: 1.2
Graph and schema management
• Study the source data
• Extend and validate the shared schema
• Generate artificial graph data
• Study system performance, iterate on the model
• Develop ingestion mappings for real data
• Review and check in schema changes
• Apply the schema to a live database
• Ingest data into the live database
Graph onboarding workflow
Revision control for schemas
• The schema is constantly changing
• Is this database compatible with this schema?
• How to update the database w.r.t. the schema?
• Use revision control to find diffs
• Ordered lists of basic changes
• Translate diffs to storage-specific workflows
• Ordered lists of idempotent operations
• Apply diff workflows to the database
Schema initialization, migration public enum SchemaChange {
AbstractAttributeChanged,
CardinalityChanged,
DomainChanged,
EntityAdded,
EntityRemoved,
ExtensionAdded,
ExtensionRemoved,
IncludeAdded,
IncludeRemoved,
IndexAdded,
IndexRemoved,
RangeChanged,
RelationAdded,
RelationRemoved,
RequiredAttributeChanged,
RequiredOfAttributeChanged,
SchemaAdded,
SchemaRemoved,
SchemaNameChanged,
SchemaVersionChanged,
}
Schema diff and patch
New
Database
Schema x.1
Schema x.2
Database at
Schema x.1
initialize
Diff of x.1
and x.2
Database at
Schema x.2
apply
diff
find
diff
Migration is not always possible
Don’t feel bad!
Basic schemas can’t be changed!
• E.g.
• Removal or abstraction of types already in use
• Changes unsupported at the storage level
Graph generation
• Problem:
• Need to predict write throughput, read latency
given 10x more data
• Analytical solutions are difficult
• Solution?
• Generate graphs of different sizes
• Study the trends
• Problem:
• Where do we get the data?
• Shrinking or growing real data is difficult
Capacity planning
• Existing graph benchmarks
• Lancichinetti-Fortunato-Radicchi (LFR) benchmark
• graphdb-benchmarks
• Linked Data Benchmark Council (LDBC)
• SPARQL benchmarks for triple stores
• None of these are very much like our data
• Not a social network; no power law distributions
• Vastly different topology
• Idea: use the schema to generate statistically
representative data
Benchmarking options
• Gather some statistics
• Entity and relationship type distributions
• Per-relationship in- and out-degree distributions
• Add these to the schema
• Give the Graphgen utility a dataset size, random seed
• Graphgen attempts to create a graph in accordance
with the model
• Gather statistics from the generated graph
• Compare and contrast
• Same dataset can be generated in different
environments
Graph generation workflow
Q&A
Joshua Shinavier
joshsh@uber.com
Kyler Liu
kylerliu@uber.com
Vignesh Ganapathy
vigneshg@uber.com
Evolution of the Graph Schema

More Related Content

What's hot

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

What's hot (20)

Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
SHACL by example
SHACL by exampleSHACL by example
SHACL by example
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Resource-Oriented Architecture (ROA)
Resource-Oriented Architecture (ROA)Resource-Oriented Architecture (ROA)
Resource-Oriented Architecture (ROA)
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
(The life of a) Data engineer
(The life of a) Data engineer(The life of a) Data engineer
(The life of a) Data engineer
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Gremlin's Anatomy
Gremlin's AnatomyGremlin's Anatomy
Gremlin's Anatomy
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 

Similar to Evolution of the Graph Schema

UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Don Demcsak
 

Similar to Evolution of the Graph Schema (20)

Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Advance Java Training in Bangalore | Best Java Training Institute
Advance Java Training in Bangalore | Best Java Training Institute Advance Java Training in Bangalore | Best Java Training Institute
Advance Java Training in Bangalore | Best Java Training Institute
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Infromation Reprentation, Structured Data and Semantics
Infromation Reprentation,Structured Data and SemanticsInfromation Reprentation,Structured Data and Semantics
Infromation Reprentation, Structured Data and Semantics
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Scaling Databases On The Cloud
Scaling Databases On The CloudScaling Databases On The Cloud
Scaling Databases On The Cloud
 
Scaing databases on the cloud
Scaing databases on the cloudScaing databases on the cloud
Scaing databases on the cloud
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
NoSql
NoSqlNoSql
NoSql
 
Ontologies & linked open data
Ontologies & linked open dataOntologies & linked open data
Ontologies & linked open data
 
ORM Methodology
ORM MethodologyORM Methodology
ORM Methodology
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 

More from Joshua Shinavier

The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
Joshua Shinavier
 

More from Joshua Shinavier (11)

In Search of the Universal Data Model (ISWC 2019 Minute Madness)
In Search of the Universal Data Model (ISWC 2019 Minute Madness)In Search of the Universal Data Model (ISWC 2019 Minute Madness)
In Search of the Universal Data Model (ISWC 2019 Minute Madness)
 
In Search of the Universal Data Model (Connected Data London 2019)
In Search of the Universal Data Model (Connected Data London 2019)In Search of the Universal Data Model (Connected Data London 2019)
In Search of the Universal Data Model (Connected Data London 2019)
 
Algebraic Property Graphs (GQL Community Update, oct. 9, 2019)
Algebraic Property Graphs (GQL Community Update, oct. 9, 2019)Algebraic Property Graphs (GQL Community Update, oct. 9, 2019)
Algebraic Property Graphs (GQL Community Update, oct. 9, 2019)
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Semantics and Sensors
Semantics and SensorsSemantics and Sensors
Semantics and Sensors
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Linked Process
Linked ProcessLinked Process
Linked Process
 
Real-time Semantic Web with Twitter Annotations
Real-time Semantic Web with Twitter AnnotationsReal-time Semantic Web with Twitter Annotations
Real-time Semantic Web with Twitter Annotations
 
Real-time #SemanticWeb in 140 chars
Real-time #SemanticWeb in 140 charsReal-time #SemanticWeb in 140 chars
Real-time #SemanticWeb in 140 chars
 
The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked Data
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Marc Lester
 

Recently uploaded (20)

OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 

Evolution of the Graph Schema

  • 1. Evolution of the Graph Schema Data Day Seattle 2017 Joshua Shinavier, PhD 20.10.2017
  • 2. 1. Knowledge and graphs 2. Semantic Web to Property Graphs 3. Re-emergence of the graph schema 4. Elements of a schema language 5. Graph and schema management 6. Graph generation Outline
  • 4. • Performance is a factor, but • Many storage back-ends can be adapted to graphs • E.g. relational DBs, column stores, key-value stores • Better reasons: • The domain model is graph-like • We can take inspiration from the way we naturally understand the world Why graph databases?
  • 6. • How do we relate data with concepts in order to make inferences or take action? • We use schemata — rules that constrain data to a language of “categories” (concepts) • Some fundamental categories are built in • E.g. plurality, necessity, limitation, negation, reciprocity, etc. • Others are built upon the foundation Kant’s “schemata” This schematism […] is an art, hidden in the depths of the human soul, whose true modes of action we shall only with difficulty discover and unveil. (Kant, 1781)
  • 7. • Psychologists saw Kant’s “schemata” in the organization of human memory (Head, 1920), but • Memory is more than storage and recall • We react to data by combining a schema with an attitude (Bartlett, 1933) Schemas in psychology
  • 8. • Scripts, plans, goals (Schank & Abelson, 1970s) • Frames (Minsky, 1974) • Early KR languages • Upper-level ontologies and commonsense knowledge bases Enter databases and AI
  • 9. Semantic Web to Property Graphs
  • 10. • A vocabulary for vocabulary sharing • Includes a handful of basic terms • Classes, properties, inheritance • Meets the needs of most Web schemas RDF Schema (RDFS)
  • 11. • A much more expressive language for ontology development • Supports: • Classes and properties with inheritance • Equality (sameAs, differentFrom, equivalentTo) • Property domain/range restrictions, cardinality restrictions • Inverse, transitive, and symmetric properties • Ontology metadata (imports, versioning) • Sublanguages OWL Full, OWL DL (description logic), OWL Lite • OWL 2 profiles EL (polynomial-time checking), QL (memory- efficient query answering), DL (completeness and decidability) • Is this slide too dense? OWL is huge. OWL
  • 12. • What commercial applications ended up using • Supported by AllegroGraph, TopBraid, etc. • All of RDFS • Classes, properties, inheritance • A few terms stolen from OWL • e.g. sameAs, inverseOf, TransitiveProperty “RDFS+”
  • 13. The Web of Data…
  • 14. • Property Graph data model takes a minimalist approach • Typically no inference or rules support • Graph DBs, schema.org are a response to real-world demands …simplified 1 3 2 foo foo bar
  • 15. Re-emergence of the graph schema
  • 16. • There is power in simplicity • NoSQL databases are said to have no predefined schema • In practice, every graph DB has a schema • A set of constraints or assumptions about correct structure • Useful for validation and optimization • There is no graph schema standard NoSQL ⇏ no schema
  • 17. • Property Graph data model is a basic schema • Edge labels (required) • Vertex labels (optional) • Property keys (required) • Property data types (optional, with optional constraints) • Vertex meta-properties (optional) Schemas in TinkerPop
  • 18. • Labels • Simple types on nodes and/or relationships • Indexes • Single-property — equality, existence, containment, ranges • Composite (multiple properties) — equality only • Constraints • Node property uniqueness • Node/relationship property existence • Node key (set of properties unique for the node) Schemas in Neo4j
  • 19. • Vertex and edge labels • Property keys • Property cardinality (SINGLE, LIST, SET) • Indexes • Graph-centric • Individual properties, composite • Vertex-centric (index on incoming/outgoing edges) • Sorting key, sort order • Automatic/implicit schema creation Schemas in JanusGraph
  • 20. • Object databases ≠ graph databases, but similar • Built-in, object-oriented schemas • Classes, extension, relationships, recursivity, etc. • Used for encapsulation, composition, inheritance, delegation, etc. • OOP frameworks for graph DBs • Frames, Ferma, etc. Schemas in object databases
  • 21. • Hypernode • Objects, relations, and functions • GROOVY • Multi-level OOP schemas • Hypergraph DB • Types and relationships • Grakn.AI • Entities, relations, roles, and resources (data type, uniqueness, regex) • Single inheritance Schemas in hypergraph databases
  • 22. Elements of a schema language
  • 23. • Support for a basic schema vocabulary • Entity and relationship types, constraints • Good coverage of existing schema frameworks • Extensibility of schemas and types • Mappings to RDF, schema.org, and storage frameworks • Reference APIs for • Schema validation • Graph schema initialization and migration • Statistical models, graph generation Design goals
  • 24. • Things about which we can make assertions • “Classes” in RDF, “types” in schema.org, “vertex labels” in TinkerPop, etc. • Extend other entity types Entity types entities: - label: Trip sameAs: http://schema.org/TravelAction description: A trip taken by a driver or requested by a rider
  • 25. • Assertions about things • “Properties” in RDF and schema.org • “Edges” vs. “properties” in graph databases • Hyperedges, meta-properties are also “relations” Relationship types relations: - label: requested description: Relates a rider to a trip he or she has requested extends: - core.relatedTo cardinality: OneToMany from: users.User to: Trip
  • 26. • Graph-centric • Single-relation, composite • Entity-centric • Ordering on a secondary key Index hints indexes: - key: core.uuid - key: trips.requested direction: Out orderBy: core.createdAt order: Decreasing
  • 27. • Schemas import other schemas, like software modules • Give developers/teams autonomy, but • Coordinate schema integration top-down Schema imports name: production version: 1.2 includes: - name: trips version: 1.2 - name: referrals version: 1.2
  • 28. Graph and schema management
  • 29. • Study the source data • Extend and validate the shared schema • Generate artificial graph data • Study system performance, iterate on the model • Develop ingestion mappings for real data • Review and check in schema changes • Apply the schema to a live database • Ingest data into the live database Graph onboarding workflow
  • 31. • The schema is constantly changing • Is this database compatible with this schema? • How to update the database w.r.t. the schema? • Use revision control to find diffs • Ordered lists of basic changes • Translate diffs to storage-specific workflows • Ordered lists of idempotent operations • Apply diff workflows to the database Schema initialization, migration public enum SchemaChange { AbstractAttributeChanged, CardinalityChanged, DomainChanged, EntityAdded, EntityRemoved, ExtensionAdded, ExtensionRemoved, IncludeAdded, IncludeRemoved, IndexAdded, IndexRemoved, RangeChanged, RelationAdded, RelationRemoved, RequiredAttributeChanged, RequiredOfAttributeChanged, SchemaAdded, SchemaRemoved, SchemaNameChanged, SchemaVersionChanged, }
  • 32. Schema diff and patch New Database Schema x.1 Schema x.2 Database at Schema x.1 initialize Diff of x.1 and x.2 Database at Schema x.2 apply diff find diff
  • 33. Migration is not always possible Don’t feel bad! Basic schemas can’t be changed! • E.g. • Removal or abstraction of types already in use • Changes unsupported at the storage level
  • 35. • Problem: • Need to predict write throughput, read latency given 10x more data • Analytical solutions are difficult • Solution? • Generate graphs of different sizes • Study the trends • Problem: • Where do we get the data? • Shrinking or growing real data is difficult Capacity planning
  • 36. • Existing graph benchmarks • Lancichinetti-Fortunato-Radicchi (LFR) benchmark • graphdb-benchmarks • Linked Data Benchmark Council (LDBC) • SPARQL benchmarks for triple stores • None of these are very much like our data • Not a social network; no power law distributions • Vastly different topology • Idea: use the schema to generate statistically representative data Benchmarking options
  • 37. • Gather some statistics • Entity and relationship type distributions • Per-relationship in- and out-degree distributions • Add these to the schema • Give the Graphgen utility a dataset size, random seed • Graphgen attempts to create a graph in accordance with the model • Gather statistics from the generated graph • Compare and contrast • Same dataset can be generated in different environments Graph generation workflow