2. THE REAL VALUE IS IN THE RELATIONSHIPS
• Google : Knowledge Graph
• Facebook: Unicorn
• Twitter: flockdb
• ....
3. WHAT IS THE PROBLEM WITH RDBMS? (PART 1)
The base question of all recommendation systems:
“User 99 has bought the products 1, 2, 3 and 765 so far. Get the list of other products bought by other users together
with the products 1, 2, 3 or 765 in descending order by popularity”
4. WHAT IS THE PROBLEM WITH RDBMS? (PART 2)
“Who are Bob’s friends-of-friends-of-friends?”
“What is the shortest path between two specific friends?”
...?
5. BASICS: WHAT IS A GRAPH?
• Origin: Euler 18th century
• It contains nodes and relationships.
• Nodes contain properties (key-value pairs).
• Nodes can be labeled with one or more labels.
• Relationships are named and directed, and
always have a start and end node.
• Relationships can also contain properties.
6. GRAPH DATABASES ON THE MARKET
• Non-native storage: data in
general purpose DB
• Native processing: index-free
8. USE CASES *
• Fraud Detection
• Graph-Based Search
• Identity and Access Management
• Master Data Management
• Network and IT Operations
• Real-Time Recommendations
• Social Network
* Detailed examples from Neo4j
9. DATA MODELING
• concept -> logical model -> physical model
• big gap between concept and DB
• structure and data volume determines query speed
• hard to change schema
• concept directly to DB
• no gap between concept and DB
• query speed not influenced by structure or data
volume
• easy to change connections
10. CYPHER – GRAPH DATABASE QUERY LANGUAGE
Name:
Joe
Name:
Bob
FRIEND
Person Person
(:Person{name:”Joe”})-[:FRIEND]->(:Person{name:”Bob”})
• Other query languages: SPARQL, Gremlin ...
• Case sensitive
• Most human friendly
11. CREATING SOME TEST DATA IN CYPHER
// creating nodes
create(:Person{name:"Tom Hanks"});
....
// creating relation between two specific nodes
match (a:Person),(b:Movie)
where
a.name='Ron Howard'
and b.title = 'The Da Vinci Code'
create (a)-[r:DIRECTED]->(b) return r;
....
// set relation property
match(Person{name:"Tom Hanks"})-[n:KNOWS]->
(Person{name:„Ron Howard"}) set n.since=1987;
....
// delete relation
match (a)-[r:KNOWS]->(b)
where
a.name='Matt Damon'
and b.name='Matt Damon'
delete r;
12. QUERYING DATA IN CYPHER
// whom does Tom Hanks know?
match (:Person{name:"Tom Hanks"})-[r:KNOWS]->(b) return b;
// who knows Steven Spielberg?
match (:Person{name:"Steven Spielberg"})<-[:KNOWS]-(b) return b;
// which films has Tom Hanks Acted in?
match (:Person{name:"Tom Hanks"})-[:ACTED_IN]-(b) return b;
// delete by id
match (n) where ID(n)=11 delete n;
// get Steven Spielberg aquantances 3 levels deep
match (:Person{name:"Steven Spielberg"})
-[:KNOWS]-(b)
-[:KNOWS]-(c)
-[:KNOWS]-(d)
return b, c, d
13. A BIGGER EXAMPLE
MATCH (tom:Person {name:"Tom Hanks"})
-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors)
RETURN tom, m, coActors
Tom Hanks’ co-actors:
14. FINDING THE SHORTEST PATH
MATCH p=shortestPath(
(kevin:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"})
)
RETURN p
The shortest path between Kevin Bacon and Meg Ryan:
15. RECOMMENDING CO-ACTORS TO TOM HANKS
MATCH
// coActors: acted in the same movies as Tom
// cocoActors: acted in the same movies as coActors but they Tom did not
// act in the same movies as the coActors
(tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors),
(coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActors)
WHERE NOT (tom)-[:ACTED_IN]->(m2)
RETURN
cocoActors.name AS Recommended,
// strength: how many times the same cocoActor was found
count(*) AS Strength ORDER BY Strength DESC
Find co-actors who haven't work with Tom Hanks (co-co-actors):
Tom
m
(movie)
ACTED_IN
coActor
ACTED_IN
m2
(movie)
ACTED_IN
cocoActor
ACTED_IN
ACTED_IN
16. NEO4J CLUSTER ARCHITECTURE
• Automatic master election
• Possible to write to slaves, but it is faster to the master
• Full replication (/data redundancy); graph sharding is under development
• Single server capacity: 34 billion nodes, 34 billion relationships, 65 thousands relationship types
and 68 billion properties
• Cluster requires a quorum in order to serve write load
• Reads done on slaves : reads scale linearly
• Exceptionally high write loads: queing and vertical scaling
• Large graph that does not fit in RAM: cache sharding by routing queries
• Online backups full / incremental supported
• Reporting instances are slaves that will never be elected to be master
17. DEVELOPMENT
Query tuning:
• execution plan
• profiling
Indexing on properties
Accessing:
• web interface
• REST API
• shell
• embedding in Java applications
• Mazerunner extension (Using Apache Spark and Neo4j for Big Data Graph Analytics)
Utilities
• neo4j-shell
• neo4j-import
• neo4j-backup
• neo4j-arbiter