Optimizing Cypher Queries in Neo4j

Optimizing Cypher
Queries in Neo4j
Wes Freeman (@wefreema)
Mark Needham (@markhneedham)

Today's schedule
• Brief overview of cypher syntax
• Graph global vs Graph local queries
• Labels and indexes
• Optimization patterns
• Profiling cypher queries
• Applying optimization patterns

Cypher Syntax
• Statement parts
o Optional: Querying part (MATCH|WHERE)
o Optional: Updating part (CREATE|MERGE)
o Optional: Returning part (WITH|RETURN)
• Parts can be chained together

Cypher Syntax - Refresher
MATCH (n:Label)-[r:LINKED]->(m)
WHERE n.prop = "..."
RETURN n, r, m

Starting points
• Graph scan (global; potentially slow)
• Label scan (usually reserved for aggregation
queries; not ideal)
• Label property index lookup (local; good!)

Introducing the football dataset

The 1.9 global scan
O(n)
n = # of nodes
START pl = node(*)
MATCH (pl)-[:played]->(stats)
WHERE pl.name = "Wayne Rooney"
RETURN stats
150ms w/ 30k nodes, 120k rels

The 2.0 global scan
MATCH (pl)-[:played]->(stats)
RETURN stats
130ms w/ 30k nodes, 120k rels
O(n)
n = # of nodes

Why is it a global scan?
• Cypher is a pattern matching language
• It doesn't discriminate unless you tell it to
o It must try to start at all nodes to find this pattern, as
specified

Introduce a label
Label your starting points
CREATE (player:Player
{name: "Wayne Rooney"} )

O(k)
k = # of nodes with that labelLabel scan
MATCH (pl:Player)-[:played]->(stats)
RETURN stats
80ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Indexes don't come for free
CREATE INDEX ON :Player(name)
OR
CREATE CONSTRAINT ON pl:Player
ASSERT pl.name IS UNIQUE

O(log k)
k = # of nodes with that labelIndex lookup
MATCH (pl:Player)-[:played]->(stats)
RETURN stats
6ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Optimization Patterns
• Avoid cartesian products
• Avoid patterns in the WHERE clause
• Start MATCH patterns at the lowest
cardinality and expand outward
• Separate MATCH patterns with minimal
expansion at each stage

Introducing the movie data set

Anti-pattern: Cartesian Products
MATCH (m:Movie), (p:Person)

Subtle Cartesian Products
MATCH (p:Person)-[:KNOWS]->(c)
WHERE p.name="Tom Hanks"
WITH c
MATCH (k:Keyword)
RETURN c, k

Counting Cartesian Products
MATCH (pl:Player),(t:Team),(g:Game)
RETURN COUNT(DISTINCT pl),
COUNT(DISTINCT t),
COUNT(DISTINCT g)
80000 ms w/ ~900 players, ~40 teams, ~1200 games

MATCH (pl:Player)
WITH COUNT(pl) as players
MATCH (t:Team)
WITH COUNT(t) as teams, players
MATCH (g:Game)
RETURN COUNT(g) as games, teams, players8ms w/
~900 players, ~40 teams, ~1200 games
Better Counting

Directions on patterns
MATCH (p:Person)-[:ACTED_IN]-(m)
WHERE p.name = "Tom Hanks"
RETURN m

Parameterize your queries
MATCH (p:Person)-[:ACTED_IN]-(m)
WHERE p.name = {name}
RETURN m

Fast predicates first
Bad:
MATCH (t:Team)-[:played_in]->(g)
WHERE NOT (t)-[:home_team]->(g)
AND g.away_goals > g.home_goals
RETURN t, COUNT(g)

Better:
MATCH (t:Team)-[:played_in]->(g)
WHERE g.away_goals > g.home_goals
AND NOT (t)-[:home_team]->()
RETURN t, COUNT(g)
Fast predicates first

Patterns in WHERE clauses
• Keep them in the MATCH
• The only pattern that needs to be in a
WHERE clause is a NOT

MERGE and CONSTRAINTs
• MERGE is MATCH or CREATE
• MERGE can take advantage of unique
constraints and indexes

MERGE (without index)
MERGE (g:Game
{date:1290257100,
time: 1245,
home_goals: 2,
away_goals: 3,
match_id: 292846,
attendance: 60102})
RETURN g
188 ms w/ ~400 games

Adding an index
CREATE INDEX ON :Game(match_id)

MERGE (with index)
MERGE (g:Game
{date:1290257100,
time: 1245,
home_goals: 2,
away_goals: 3,
match_id: 292846,
attendance: 60102})
RETURN g
6 ms w/ ~400 games

Alternative MERGE approach
MERGE (g:Game { match_id: 292846 })
ON CREATE
SET g.date = 1290257100
SET g.time = 1245
SET g.home_goals = 2
SET g.away_goals = 3
SET g.attendance = 60102
RETURN g

Profiling queries
• Use the PROFILE keyword in front of the
query
o from webadmin or shell - won't work in browser
• Look for db_hits and rows
• Ignore everything else (for now!)

Reviewing the football dataset

Football Optimization
MATCH (game)<-[:contains_match]-(season:Season),
(team)<-[:away_team]-(game),
(stats)-[:in]->(game),
(team)<-[:for]-(stats)<-[:played]-(player)
WHERE season.name = "2012-2013"
RETURN player.name,
COLLECT(DISTINCT team.name),
SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 103137 ms w/ ~900 players, ~20 teams, ~400 games

Football Optimization
==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a", "
INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323"], returnItemNames=["player.name", "COLLECT(DISTINCT
team.name)", "goals"], _rows=10, _db_hits=0)
==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323 of type Number),false)"],
limit="Literal(10)", _rows=10, _db_hits=0)
==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-
e746407c504a,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-
8ca9-fb0286f3d323,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)
==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED108", "season", " UNNAMED55", "player", "team", " UNNAMED124", "
UNNAMED85", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)
==> PatternMatch(g="(player)-[' UNNAMED124']-(stats)", _rows=5192, _db_hits=0)
==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=5192, _db_hits=15542)
==> TraversalMatcher(trail="(season)-[ UNNAMED12:contains_match WHERE true AND true]->(game)<-[ UNNAMED85:in WHERE
true AND true]-(stats)-[ UNNAMED108:for WHERE true AND true]->(team)<-[ UNNAMED55:away_team WHERE true AND true]-
(game)", _rows=15542, _db_hits=1620462)

Break out the match statements
MATCH (game)<-[:contains_match]-(season:Season)
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
RETURN player.name,
ORDER BY goals DESC

Start small
• Smallest cardinality label first
• Smallest intermediate result set first

Exploring cardinalities
RETURN COUNT(DISTINCT game), COUNT(DISTINCT season)
1140 games, 3 seasons
MATCH (team)<-[:away_team]-(game:Game)
RETURN COUNT(DISTINCT team), COUNT(DISTINCT game)
25 teams, 1140 games

Exploring cardinalities
MATCH (stats)-[:in]->(game:Game)
RETURN COUNT(DISTINCT stats), COUNT(DISTINCT game)
31117 stats, 1140 games
MATCH (stats)<-[:played]-(player:Player)
RETURN COUNT(DISTINCT stats), COUNT(DISTINCT player)
31117 stats, 880 players

Look for teams first
MATCH (team)<-[:away_team]-(game:Game)
MATCH (game)<-[:contains_match]-(season)
RETURN player.name,
ORDER BY goals DESC

==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969", "
INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5"], returnItemNames=["player.name", "COLLECT(DISTINCT
team.name)", "goals"], _rows=10, _db_hits=0)
==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5 of type Number),false)"],
limit="Literal(10)", _rows=10, _db_hits=0)
==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-
b0c7ec85c969,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE199af213-e3bd-400f-
aba9-8ca2a9e153c5,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)
==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED168", "season", " UNNAMED125", "player", "team", " UNNAMED152", "
UNNAMED51", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)
==> PatternMatch(g="(stats)-[' UNNAMED152']-(team),(player)-[' UNNAMED168']-(stats)", _rows=5192, _db_hits=0)
==> PatternMatch(g="(stats)-[' UNNAMED125']-(game)", _rows=10394, _db_hits=0)
==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=380, _db_hits=380)
==> PatternMatch(g="(season)-[' UNNAMED51']-(game)", _rows=380, _db_hits=1140)
==> TraversalMatcher(trail="(game)-[ UNNAMED12:away_team WHERE true AND true]->(team)", _rows=1140,
_db_hits=1140)
Look for teams first

Filter games a bit earlier
RETURN player.name,
ORDER BY goals DESC

Filter out stats with no goals
MATCH (stats)-[:in]->(game)WHERE stats.goals > 0
RETURN player.name,
ORDER BY goals DESC
LIMIT 10
59 ms w/ ~900 players, ~20 teams, ~400 games

Movie query optimization
MATCH (movie:Movie {title: {title} })
MATCH (genre)<-[:HAS_GENRE]-(movie)
MATCH (director)-[:DIRECTED]->(movie)
MATCH (actor)-[:ACTED_IN]->(movie)
MATCH (writer)-[:WRITER_OF]->(movie)
MATCH (actor)-[:ACTED_IN]->(actormovies)
MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)
WITH DISTINCT movies as related, count(DISTINCT keyword) as weight,
count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as
genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name)
as writers
ORDER BY weight DESC, actormoviesweight DESC
WITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie,
collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres,
directors, writers
MATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)
WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors,
genres, directors, writers
ORDER BY keyword_weight
RETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

MATCH (movie:Movie {title: 'The Matrix' })
as writers
directors, writers

MATCH (movie:Movie {title: 'The Matrix' })MATCH (genre)<-[:HAS_GENRE]-
(movie)
as writers
directors, writers

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]-
>(movie)
as writers
directors, writers

MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]-
>(movie)
as writers
directors, writers

MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]-
>(movie)
as writers
directors, writers

MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]-
>(actormovies)
as writers
directors, writers

MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH
(movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)
as writers
directors, writers

MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related,
count(DISTINCT keyword) as weight,
as writersORDER BY weight DESC, actormoviesweight DESC
directors, writers

as writers
ORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight:
actormoviesweight}) as actors,
movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as
related, genres, directors, writers

as writers
WITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors,
related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-
[:HAS_KEYWORD]-(movies)

as writers
WITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors,
related, genres, directors, writers
MATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword,
count(movies) as keyword_weight, movie, related,
actors, genres, directors, writers

MATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)
WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actor
ORDER BY actormoviesweight DESC
WITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row
MATCH (movie)-[:HAS_GENRE]->(genre)
WITH movie, actors, collect(genre) as genres // 1 row
WITH movie, actors, genres, collect(director.name) as directors // 1 row
WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row
WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres,
directors, actors, writers // 1 row per related movie
ORDER BY keywords DESC
WITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related,
movie, actors, genres, directors, writers // 1 row
MATCH (movie)-[:HAS_KEYWORD]->(keyword)
RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers
10x faster

MATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)WITH movie, actor, length((actor)-[:ACTED_IN]-
>()) as actormoviesweight
ORDER BY actormoviesweight DESC // 1 row per actor
10x faster

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight
ORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight:
actormoviesweight}) as actors // 1 row
10x faster

WITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]-
>(genre)
WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)
10x faster

WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]-
>(keyword)<-[:HAS_KEYWORD]-(movies:Movie)
WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie,
genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESC
10x faster

genres, directors, actors, writers // 1 row per related movie
ORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as
related, movie, actors, genres, directors, writers // 1 row
10x faster

genres, directors, actors, writers // 1 row per related movie
WITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as
related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)
RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors,
writers // 1 row
10x faster

Making the implicit explicit
• When you have implicit relationships in the
graph you can sometimes get better query
performance by modeling the relationship
explicitly

Refactor property to node
Bad:
MATCH (g:Game)
WHERE
g.date > 1343779200
AND g.date < 1369094400
RETURN g

Good:
MATCH (s:Season)-[:contains]->(g)
RETURN g
Refactor property to node

Conclusion
• Avoid the global scan
• Add indexes / unique constraints
• Split up MATCH statements
• Measure, measure, measure, tweak, repeat
• Soon Cypher will do a lot of this for you!

Bonus tip
• Use transactions/transactional cypher
endpoint

Q & A
• If you have them send them in

Optimizing Cypher Queries in Neo4j

More Related Content

What's hot

Viewers also liked

Similar to Optimizing Cypher Queries in Neo4j

More from Neo4j

Recently uploaded

Optimizing Cypher Queries in Neo4j