SlideShare a Scribd company logo
M. David Allen
dmallen@mitre.org
Family Tree of Data –
Provenance and Neo4J
Graph Database Meet up, Arlington
February 10th, 2015
Approved for Public Release; Distribution Unlimited. Case Number 15-0190
The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to
convey or imply MITRE's concurrence with, or support for, the positions, opinions or viewpoints expressed by the author
What?! Where did
that come from?
Says who?
Information is gathered and
fused from many different
sources. Often, it’s lacking
context. How do you know
what to trust?
Intelligence Corroboration
 Differentiate independent corroborating reports from
multiple reports derived from a single source
– Critical in helping to differentiate between reports of 5 wounded
and 5 reports of 1 wounded
Trust Assessment and Warning
Intrusion detection & security
improved assessment and warning
 Producers and consumers of
information are often decoupled
(good)
 The right hand has no way of
finding out what the left hand is
doing (bad)
 Provenance permits taint and
marking propagation
– Add markings to reflect domain-
dependent concerns
 Every information resource
becomes its own feed:
producers can keep consumers
up to date, without coupling
Objective: Trust Assessment and Warning
Mission Impact Assessment
What if something breaks? Who
is affected?
Dependencies among information assets improve mission
impact assessment and contingency planning.
Provenance Graphs
• Strong temporal element (left -> right layout)
• Rectangles denote processes or algorithms
• Circles denote data (inputs and outputs to rectangles)
• Graphs can be nested; each item may be further detailed by a separate graph
Storing Provenance in a
Relational Database
8
• Previously: MySQL and PostgreSQL
• Storage structure was (simplified):
• “Provenance Object” table
• “Provenance Edge” table
• “from” property (foreign key to objects)
• “to” property (foreign key to objects)
• Discovering graph structure was a process of repeatedly
joining object table to edge table
RDBMS Provenance Storage
9
• Pros:
• Fast individual object lookup by property
• Good performance on one-hop edges: (what is
immediately upstream from “IRIS”?)
• Cons:
• Writing SQL for graph operations was very complex,
and performed very poorly. (“Give me all nodes
between 2 and 5 hops from IRIS”)
• SQL doesn’t support graphs, so you have two
choices:
• Wizard-Level SQL
• Push graph operations (BFS, DFS, shortest
path) into java code
 Write me a SQL query that
finds all paths from node 1
to node 6, and orders the
results by path length.
 (Requires extra step
defining views, sometimes
separate T-SQL methods)
 Requires re-implementing
basic graph concepts
(paths) in SQL
Wizard-Level SQL for Graphs:
Painful and Error Prone
Fitness Widgets
 Probably one of the “killer
apps” for provenance
 Simple provenance analysis
functions, integrated into
other applications
 Most are simple queries over a
provenance graph
 With RDBMS storage, every
fitness widget requires a new
java module
– …because using RDBMS has
the consequence of pushing
graph operations into code
– Bad, bad, bad
An Idea
What if we used a graph database to
store and query our graphs?
(What a concept!)
 Declarative graph query language, NOT imperative graph
traversal
– “Here’s what I want. Database, you figure out how to make it
happen”
 Flexible schema
– Each new provenance environment has a different collection of
important metadata, and has to be tailored each time
 Has to play nice with Java because enterprises trust Java
– Stodgy IT environments don’t care how cool your OCaml
implementation is, they won’t run it.
Top on the Shopping List
https://github.com/plus-provenance/plus
Clone it, “mvn jetty:run”, then hit http://localhost:8080/plus/
PLUS (MITRE’s Research Software)
 Provider of “information fitness widgets” – canned analytical
queries over provenance graphs
 Basic database for capture, query, reporting over provenance
 API for building provenance capture agents in new
environments
 Sandbox for advanced applications of provenance
 Focused more on capture in distributed systems environments,
more so than “hand curated” provenance data sets
Uses of PLUS
Example Queries (for a taste)
match (n {oid: {oid}})<-[*1..5]-m
where m.type='invocation'
return m;
match (n {oid: {oid}})-[*]->m
where m.name =~ '.*COP.*'
return m;
Get all upstream invocations between 1 and 5 steps away
Does this item flow into any Common Operating Picture? (COP)
Example Queries (for a taste)
match m-[r:*]->(n {oid: {oid}})
where m.name='GCCS-I3'
return length(r);
START n=node:node_auto_index(oid={oid})
match (n {oid: {oid}})<-[r:*]-m,
owner-[r1:owns]->m
where m.type='data' and owner.name='TBMCS'
return count(distinct m);
How many hops away is GCCS-I3?
How many different data items from TBMCS contribute to this node?
What do we give up with Cypher & Neo4J?
 Graph databases are said to be “naturally indexed” by
relationships
 Nodes can be indexed by label and by properties, but they will
never be as performant as RDBMSs for certain kinds of bulk
queries
How to make neo4j performance look terrible
(in comparison): RSS Feeds
MATCH n
WHERE n.type? = 'data' AND
n.created? > (today at midnight) AND
n.created? < (now)
RETURN n
What are the latest reported provenance items?
“Graph Fishing Expeditions” – when you’re not starting from anywhere.
“Bulk Scans” – generic queries that don’t apply to any particular label or index
subset.
“The Table Anti-Pattern” – link everything by ID instead of relationship
How to Think in Graphs:
Get all relationships in a workflow
A B C
workflow
instanceinstance
instance
A B C
Workflow
id=1
Node/Table Orientation
relationship
wf=1
relationship
wf=1
Graph Orientation
START r=relationship(*)
WHERE r.workflow = 'SOME WORKFLOW ID'
RETURN r
MATCH (wf:Workflow {oid: ‘foo’})
->[r:instance]->node,
node-[pr:generated|`input to`]->m
RETURN pr
SLOWER
(and requires special index on “id” property)
FASTER
Some other Observations on Graph
Databases: Good and Bad
 Partitioning and sharding graphs is very difficult, and still subject
to some research
– Other graph DBs out there (e.g. Apache Giraph) somewhat hide this
problem, or make other compromises to get around it (“Bulk
Synchronous Parallel” algorithms)
 Neo4J presently scales to billions of nodes per machine, and can
traverse thousands of relationships very quickly
 Much more natural mapping from OO class hierarchies into graph
databases than to RDBMS (object/table impedance mismatch)
 Graph performance tuning is new thing to most operations people
and sysadmins;
– Everybody knows how to make Oracle fly, graph skills are much less
common
– Can make for perception problems, compounded by poor graph
design (e.g. “table orientation”, designing a graph like it’s a table)
Contact Information
M. David Allen
The MITRE Corporation
Office: (804) 288-0355
dmallen@mitre.org
The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is
not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions or
viewpoints expressed by the author
Backups and Additional Materials
PLUS Provenance Service
Provenance Manager
PLUS
Users &
Applications
Administrators
Provenance Store
PLUS
Applications &
Capture Agents
Report
AnnotateRetrieve
Administer
(access control,
archiving, etc.)
API
(provenance-aware
applications)
Coordination points for
automatic provenance
capture
Architectural Options for Provenance Capture
 “Smart Applications”
– Strategy: Each application calls lineage API to log whatever it
thinks is important
– But, unrealistic for legacy applications
 “Interceptors”
– Strategy: Listen in to whatever is happening, and log silently as it
happens
– Requires a small number of points of lineage capture: ESBs are
ideal, since they act as central “routers”
 “Wrappers”
– Strategy: Write a transparent wrapper service. Make sure all
orchestrations call the wrapper service with enough information
for the wrapper to invoke the real thing

More Related Content

What's hot

Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
Neo4j
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
Tobias Lindaaker
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
Neo4j
 
Vector database
Vector databaseVector database
Vector database
Guy Korland
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
Max De Marzi
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
Neo4j
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
Neo4j
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
Neo4j
 
Graph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jGraph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4j
Neo4j
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
Neo4j
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Neo4j
 
Graph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptxGraph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptx
Neo4j
 
Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4jjexp
 
Graph databases
Graph databasesGraph databases
Graph databases
Vinoth Kannan
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdf
Neo4j
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Neo4j
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 

What's hot (20)

Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
 
Vector database
Vector databaseVector database
Vector database
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
 
Graph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jGraph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4j
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
 
Graph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptxGraph Data Modeling Best Practices(Eric_Monk).pptx
Graph Data Modeling Best Practices(Eric_Monk).pptx
 
Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4j
 
Graph databases
Graph databasesGraph databases
Graph databases
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdf
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 

Viewers also liked

An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j InternalsTobias Lindaaker
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
Max De Marzi
 
Pattern: Activity stream
Pattern: Activity streamPattern: Activity stream
Pattern: Activity streamsystay
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
MongoDB
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
Hortonworks
 
Introduction to DAX - Part 1
Introduction to DAX - Part 1Introduction to DAX - Part 1
Introduction to DAX - Part 1
Alan Koo
 
Technical introduction to Hyperledger's Fabric
Technical introduction to Hyperledger's FabricTechnical introduction to Hyperledger's Fabric
Technical introduction to Hyperledger's Fabric
Altoros
 
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013Neo4j
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Lucidworks
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
Debanjan Mahata
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
HBaseCon
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
TEDx Talk 2014: Sales 2020, Future trends in sales and sales management.
TEDx Talk 2014: Sales 2020, Future trends in sales and sales management. TEDx Talk 2014: Sales 2020, Future trends in sales and sales management.
TEDx Talk 2014: Sales 2020, Future trends in sales and sales management.
Antwerp Management School
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
Mats Kindahl
 
Neo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesNeo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesPeter Neubauer
 

Viewers also liked (19)

An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Pattern: Activity stream
Pattern: Activity streamPattern: Activity stream
Pattern: Activity stream
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Introduction to DAX - Part 1
Introduction to DAX - Part 1Introduction to DAX - Part 1
Introduction to DAX - Part 1
 
Technical introduction to Hyperledger's Fabric
Technical introduction to Hyperledger's FabricTechnical introduction to Hyperledger's Fabric
Technical introduction to Hyperledger's Fabric
 
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
TEDx Talk 2014: Sales 2020, Future trends in sales and sales management.
TEDx Talk 2014: Sales 2020, Future trends in sales and sales management. TEDx Talk 2014: Sales 2020, Future trends in sales and sales management.
TEDx Talk 2014: Sales 2020, Future trends in sales and sales management.
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
 
Neo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesNeo4j - 5 cool graph examples
Neo4j - 5 cool graph examples
 

Similar to Family tree of data – provenance and neo4j

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
StampedeCon
 
Graph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemGraph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft Ecosystem
Marco Parenzan
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
Koray Kocabas
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
Samet KILICTAS
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Neo4j
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
dallemang
 
Mr bi
Mr biMr bi
Mr bi
renjan131
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernate
s4al_com
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA RMDM Latam
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
University of Washington
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignAntonio Castellon
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 

Similar to Family tree of data – provenance and neo4j (20)

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemGraph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft Ecosystem
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Mr bi
Mr biMr bi
Mr bi
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernate
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User Presentation
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Family tree of data – provenance and neo4j

  • 1. M. David Allen dmallen@mitre.org Family Tree of Data – Provenance and Neo4J Graph Database Meet up, Arlington February 10th, 2015 Approved for Public Release; Distribution Unlimited. Case Number 15-0190 The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions or viewpoints expressed by the author
  • 2. What?! Where did that come from? Says who? Information is gathered and fused from many different sources. Often, it’s lacking context. How do you know what to trust?
  • 3. Intelligence Corroboration  Differentiate independent corroborating reports from multiple reports derived from a single source – Critical in helping to differentiate between reports of 5 wounded and 5 reports of 1 wounded
  • 4. Trust Assessment and Warning Intrusion detection & security improved assessment and warning
  • 5.  Producers and consumers of information are often decoupled (good)  The right hand has no way of finding out what the left hand is doing (bad)  Provenance permits taint and marking propagation – Add markings to reflect domain- dependent concerns  Every information resource becomes its own feed: producers can keep consumers up to date, without coupling Objective: Trust Assessment and Warning
  • 6. Mission Impact Assessment What if something breaks? Who is affected? Dependencies among information assets improve mission impact assessment and contingency planning.
  • 7. Provenance Graphs • Strong temporal element (left -> right layout) • Rectangles denote processes or algorithms • Circles denote data (inputs and outputs to rectangles) • Graphs can be nested; each item may be further detailed by a separate graph
  • 8. Storing Provenance in a Relational Database 8 • Previously: MySQL and PostgreSQL • Storage structure was (simplified): • “Provenance Object” table • “Provenance Edge” table • “from” property (foreign key to objects) • “to” property (foreign key to objects) • Discovering graph structure was a process of repeatedly joining object table to edge table
  • 9. RDBMS Provenance Storage 9 • Pros: • Fast individual object lookup by property • Good performance on one-hop edges: (what is immediately upstream from “IRIS”?) • Cons: • Writing SQL for graph operations was very complex, and performed very poorly. (“Give me all nodes between 2 and 5 hops from IRIS”) • SQL doesn’t support graphs, so you have two choices: • Wizard-Level SQL • Push graph operations (BFS, DFS, shortest path) into java code
  • 10.  Write me a SQL query that finds all paths from node 1 to node 6, and orders the results by path length.  (Requires extra step defining views, sometimes separate T-SQL methods)  Requires re-implementing basic graph concepts (paths) in SQL Wizard-Level SQL for Graphs: Painful and Error Prone
  • 11. Fitness Widgets  Probably one of the “killer apps” for provenance  Simple provenance analysis functions, integrated into other applications  Most are simple queries over a provenance graph  With RDBMS storage, every fitness widget requires a new java module – …because using RDBMS has the consequence of pushing graph operations into code – Bad, bad, bad
  • 12. An Idea What if we used a graph database to store and query our graphs? (What a concept!)
  • 13.  Declarative graph query language, NOT imperative graph traversal – “Here’s what I want. Database, you figure out how to make it happen”  Flexible schema – Each new provenance environment has a different collection of important metadata, and has to be tailored each time  Has to play nice with Java because enterprises trust Java – Stodgy IT environments don’t care how cool your OCaml implementation is, they won’t run it. Top on the Shopping List
  • 14. https://github.com/plus-provenance/plus Clone it, “mvn jetty:run”, then hit http://localhost:8080/plus/ PLUS (MITRE’s Research Software)
  • 15.  Provider of “information fitness widgets” – canned analytical queries over provenance graphs  Basic database for capture, query, reporting over provenance  API for building provenance capture agents in new environments  Sandbox for advanced applications of provenance  Focused more on capture in distributed systems environments, more so than “hand curated” provenance data sets Uses of PLUS
  • 16. Example Queries (for a taste) match (n {oid: {oid}})<-[*1..5]-m where m.type='invocation' return m; match (n {oid: {oid}})-[*]->m where m.name =~ '.*COP.*' return m; Get all upstream invocations between 1 and 5 steps away Does this item flow into any Common Operating Picture? (COP)
  • 17. Example Queries (for a taste) match m-[r:*]->(n {oid: {oid}}) where m.name='GCCS-I3' return length(r); START n=node:node_auto_index(oid={oid}) match (n {oid: {oid}})<-[r:*]-m, owner-[r1:owns]->m where m.type='data' and owner.name='TBMCS' return count(distinct m); How many hops away is GCCS-I3? How many different data items from TBMCS contribute to this node?
  • 18. What do we give up with Cypher & Neo4J?  Graph databases are said to be “naturally indexed” by relationships  Nodes can be indexed by label and by properties, but they will never be as performant as RDBMSs for certain kinds of bulk queries
  • 19. How to make neo4j performance look terrible (in comparison): RSS Feeds MATCH n WHERE n.type? = 'data' AND n.created? > (today at midnight) AND n.created? < (now) RETURN n What are the latest reported provenance items? “Graph Fishing Expeditions” – when you’re not starting from anywhere. “Bulk Scans” – generic queries that don’t apply to any particular label or index subset. “The Table Anti-Pattern” – link everything by ID instead of relationship
  • 20. How to Think in Graphs: Get all relationships in a workflow A B C workflow instanceinstance instance A B C Workflow id=1 Node/Table Orientation relationship wf=1 relationship wf=1 Graph Orientation START r=relationship(*) WHERE r.workflow = 'SOME WORKFLOW ID' RETURN r MATCH (wf:Workflow {oid: ‘foo’}) ->[r:instance]->node, node-[pr:generated|`input to`]->m RETURN pr SLOWER (and requires special index on “id” property) FASTER
  • 21. Some other Observations on Graph Databases: Good and Bad  Partitioning and sharding graphs is very difficult, and still subject to some research – Other graph DBs out there (e.g. Apache Giraph) somewhat hide this problem, or make other compromises to get around it (“Bulk Synchronous Parallel” algorithms)  Neo4J presently scales to billions of nodes per machine, and can traverse thousands of relationships very quickly  Much more natural mapping from OO class hierarchies into graph databases than to RDBMS (object/table impedance mismatch)  Graph performance tuning is new thing to most operations people and sysadmins; – Everybody knows how to make Oracle fly, graph skills are much less common – Can make for perception problems, compounded by poor graph design (e.g. “table orientation”, designing a graph like it’s a table)
  • 22. Contact Information M. David Allen The MITRE Corporation Office: (804) 288-0355 dmallen@mitre.org The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions or viewpoints expressed by the author
  • 24. PLUS Provenance Service Provenance Manager PLUS Users & Applications Administrators Provenance Store PLUS Applications & Capture Agents Report AnnotateRetrieve Administer (access control, archiving, etc.) API (provenance-aware applications) Coordination points for automatic provenance capture
  • 25. Architectural Options for Provenance Capture  “Smart Applications” – Strategy: Each application calls lineage API to log whatever it thinks is important – But, unrealistic for legacy applications  “Interceptors” – Strategy: Listen in to whatever is happening, and log silently as it happens – Requires a small number of points of lineage capture: ESBs are ideal, since they act as central “routers”  “Wrappers” – Strategy: Write a transparent wrapper service. Make sure all orchestrations call the wrapper service with enough information for the wrapper to invoke the real thing