By András Fehér
THE REAL VALUE IS IN THE RELATIONSHIPS
• Google : Knowledge Graph
• Facebook: Unicorn
• Twitter: flockdb
• ....
WHAT IS THE PROBLEM WITH RDBMS? (PART 1)
The base question of all recommendation systems:
“User 99 has bought the products 1, 2, 3 and 765 so far. Get the list of other products bought by other users together
with the products 1, 2, 3 or 765 in descending order by popularity”
WHAT IS THE PROBLEM WITH RDBMS? (PART 2)
“Who are Bob’s friends-of-friends-of-friends?”
“What is the shortest path between two specific friends?”
...?
BASICS: WHAT IS A GRAPH?
• Origin: Euler 18th century
• It contains nodes and relationships.
• Nodes contain properties (key-value pairs).
• Nodes can be labeled with one or more labels.
• Relationships are named and directed, and
always have a start and end node.
• Relationships can also contain properties.
GRAPH DATABASES ON THE MARKET
• Non-native storage: data in
general purpose DB
• Native processing: index-free
SOCIAL NETWORK SPEED TEST
1 000 000 people each with approximately 50 friends:
USE CASES *
• Fraud Detection
• Graph-Based Search
• Identity and Access Management
• Master Data Management
• Network and IT Operations
• Real-Time Recommendations
• Social Network
* Detailed examples from Neo4j
DATA MODELING
• concept -> logical model -> physical model
• big gap between concept and DB
• structure and data volume determines query speed
• hard to change schema
• concept directly to DB
• no gap between concept and DB
• query speed not influenced by structure or data
volume
• easy to change connections
CYPHER – GRAPH DATABASE QUERY LANGUAGE
Name:
Joe
Name:
Bob
FRIEND
Person Person
(:Person{name:”Joe”})-[:FRIEND]->(:Person{name:”Bob”})
• Other query languages: SPARQL, Gremlin ...
• Case sensitive
• Most human friendly
CREATING SOME TEST DATA IN CYPHER
// creating nodes
create(:Person{name:"Tom Hanks"});
....
// creating relation between two specific nodes
match (a:Person),(b:Movie)
where
a.name='Ron Howard'
and b.title = 'The Da Vinci Code'
create (a)-[r:DIRECTED]->(b) return r;
....
// set relation property
match(Person{name:"Tom Hanks"})-[n:KNOWS]->
(Person{name:„Ron Howard"}) set n.since=1987;
....
// delete relation
match (a)-[r:KNOWS]->(b)
where
a.name='Matt Damon'
and b.name='Matt Damon'
delete r;
QUERYING DATA IN CYPHER
// whom does Tom Hanks know?
match (:Person{name:"Tom Hanks"})-[r:KNOWS]->(b) return b;
// who knows Steven Spielberg?
match (:Person{name:"Steven Spielberg"})<-[:KNOWS]-(b) return b;
// which films has Tom Hanks Acted in?
match (:Person{name:"Tom Hanks"})-[:ACTED_IN]-(b) return b;
// delete by id
match (n) where ID(n)=11 delete n;
// get Steven Spielberg aquantances 3 levels deep
match (:Person{name:"Steven Spielberg"})
-[:KNOWS]-(b)
-[:KNOWS]-(c)
-[:KNOWS]-(d)
return b, c, d
A BIGGER EXAMPLE
MATCH (tom:Person {name:"Tom Hanks"})
-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors)
RETURN tom, m, coActors
Tom Hanks’ co-actors:
FINDING THE SHORTEST PATH
MATCH p=shortestPath(
(kevin:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"})
)
RETURN p
The shortest path between Kevin Bacon and Meg Ryan:
RECOMMENDING CO-ACTORS TO TOM HANKS
MATCH
// coActors: acted in the same movies as Tom
// cocoActors: acted in the same movies as coActors but they Tom did not
// act in the same movies as the coActors
(tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors),
(coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActors)
WHERE NOT (tom)-[:ACTED_IN]->(m2)
RETURN
cocoActors.name AS Recommended,
// strength: how many times the same cocoActor was found
count(*) AS Strength ORDER BY Strength DESC
Find co-actors who haven't work with Tom Hanks (co-co-actors):
Tom
m
(movie)
ACTED_IN
coActor
ACTED_IN
m2
(movie)
ACTED_IN
cocoActor
ACTED_IN
ACTED_IN
NEO4J CLUSTER ARCHITECTURE
• Automatic master election
• Possible to write to slaves, but it is faster to the master
• Full replication (/data redundancy); graph sharding is under development
• Single server capacity: 34 billion nodes, 34 billion relationships, 65 thousands relationship types
and 68 billion properties
• Cluster requires a quorum in order to serve write load
• Reads done on slaves : reads scale linearly
• Exceptionally high write loads: queing and vertical scaling
• Large graph that does not fit in RAM: cache sharding by routing queries
• Online backups full / incremental supported
• Reporting instances are slaves that will never be elected to be master
DEVELOPMENT
Query tuning:
• execution plan
• profiling
Indexing on properties
Accessing:
• web interface
• REST API
• shell
• embedding in Java applications
• Mazerunner extension (Using Apache Spark and Neo4j for Big Data Graph Analytics)
Utilities
• neo4j-shell
• neo4j-import
• neo4j-backup
• neo4j-arbiter
RESOURCES
Good official manual
From Relational to Graph:
A Developer's Guide

Neo4j 20 minutes introduction

  • 1.
  • 2.
    THE REAL VALUEIS IN THE RELATIONSHIPS • Google : Knowledge Graph • Facebook: Unicorn • Twitter: flockdb • ....
  • 3.
    WHAT IS THEPROBLEM WITH RDBMS? (PART 1) The base question of all recommendation systems: “User 99 has bought the products 1, 2, 3 and 765 so far. Get the list of other products bought by other users together with the products 1, 2, 3 or 765 in descending order by popularity”
  • 4.
    WHAT IS THEPROBLEM WITH RDBMS? (PART 2) “Who are Bob’s friends-of-friends-of-friends?” “What is the shortest path between two specific friends?” ...?
  • 5.
    BASICS: WHAT ISA GRAPH? • Origin: Euler 18th century • It contains nodes and relationships. • Nodes contain properties (key-value pairs). • Nodes can be labeled with one or more labels. • Relationships are named and directed, and always have a start and end node. • Relationships can also contain properties.
  • 6.
    GRAPH DATABASES ONTHE MARKET • Non-native storage: data in general purpose DB • Native processing: index-free
  • 7.
    SOCIAL NETWORK SPEEDTEST 1 000 000 people each with approximately 50 friends:
  • 8.
    USE CASES * •Fraud Detection • Graph-Based Search • Identity and Access Management • Master Data Management • Network and IT Operations • Real-Time Recommendations • Social Network * Detailed examples from Neo4j
  • 9.
    DATA MODELING • concept-> logical model -> physical model • big gap between concept and DB • structure and data volume determines query speed • hard to change schema • concept directly to DB • no gap between concept and DB • query speed not influenced by structure or data volume • easy to change connections
  • 10.
    CYPHER – GRAPHDATABASE QUERY LANGUAGE Name: Joe Name: Bob FRIEND Person Person (:Person{name:”Joe”})-[:FRIEND]->(:Person{name:”Bob”}) • Other query languages: SPARQL, Gremlin ... • Case sensitive • Most human friendly
  • 11.
    CREATING SOME TESTDATA IN CYPHER // creating nodes create(:Person{name:"Tom Hanks"}); .... // creating relation between two specific nodes match (a:Person),(b:Movie) where a.name='Ron Howard' and b.title = 'The Da Vinci Code' create (a)-[r:DIRECTED]->(b) return r; .... // set relation property match(Person{name:"Tom Hanks"})-[n:KNOWS]-> (Person{name:„Ron Howard"}) set n.since=1987; .... // delete relation match (a)-[r:KNOWS]->(b) where a.name='Matt Damon' and b.name='Matt Damon' delete r;
  • 12.
    QUERYING DATA INCYPHER // whom does Tom Hanks know? match (:Person{name:"Tom Hanks"})-[r:KNOWS]->(b) return b; // who knows Steven Spielberg? match (:Person{name:"Steven Spielberg"})<-[:KNOWS]-(b) return b; // which films has Tom Hanks Acted in? match (:Person{name:"Tom Hanks"})-[:ACTED_IN]-(b) return b; // delete by id match (n) where ID(n)=11 delete n; // get Steven Spielberg aquantances 3 levels deep match (:Person{name:"Steven Spielberg"}) -[:KNOWS]-(b) -[:KNOWS]-(c) -[:KNOWS]-(d) return b, c, d
  • 13.
    A BIGGER EXAMPLE MATCH(tom:Person {name:"Tom Hanks"}) -[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors) RETURN tom, m, coActors Tom Hanks’ co-actors:
  • 14.
    FINDING THE SHORTESTPATH MATCH p=shortestPath( (kevin:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"}) ) RETURN p The shortest path between Kevin Bacon and Meg Ryan:
  • 15.
    RECOMMENDING CO-ACTORS TOTOM HANKS MATCH // coActors: acted in the same movies as Tom // cocoActors: acted in the same movies as coActors but they Tom did not // act in the same movies as the coActors (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors), (coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActors) WHERE NOT (tom)-[:ACTED_IN]->(m2) RETURN cocoActors.name AS Recommended, // strength: how many times the same cocoActor was found count(*) AS Strength ORDER BY Strength DESC Find co-actors who haven't work with Tom Hanks (co-co-actors): Tom m (movie) ACTED_IN coActor ACTED_IN m2 (movie) ACTED_IN cocoActor ACTED_IN ACTED_IN
  • 16.
    NEO4J CLUSTER ARCHITECTURE •Automatic master election • Possible to write to slaves, but it is faster to the master • Full replication (/data redundancy); graph sharding is under development • Single server capacity: 34 billion nodes, 34 billion relationships, 65 thousands relationship types and 68 billion properties • Cluster requires a quorum in order to serve write load • Reads done on slaves : reads scale linearly • Exceptionally high write loads: queing and vertical scaling • Large graph that does not fit in RAM: cache sharding by routing queries • Online backups full / incremental supported • Reporting instances are slaves that will never be elected to be master
  • 17.
    DEVELOPMENT Query tuning: • executionplan • profiling Indexing on properties Accessing: • web interface • REST API • shell • embedding in Java applications • Mazerunner extension (Using Apache Spark and Neo4j for Big Data Graph Analytics) Utilities • neo4j-shell • neo4j-import • neo4j-backup • neo4j-arbiter
  • 18.
    RESOURCES Good official manual FromRelational to Graph: A Developer's Guide

Editor's Notes

  • #9 Blue: RDBMS vs Neo4j experience