SlideShare a Scribd company logo
© 2022 Neo4j, Inc. All rights reserved.
Scaling into Billions of Nodes
and Relationships with Neo4j
Graph Data Science
Martin Junghanns
Senior Software Engineer
Neo4j - Product Engineering
© 2022 Neo4j, Inc. All rights reserved.
Outline
• The Neo4j Graph Data Science (GDS) Platform
◦ Overview
◦ A generic GDS workflow
• Scalability Challenges
• Neo4j GDS under the microscope
◦ Graph Projection
◦ “Huge” Data Structures
◦ Algorithm Execution
◦ Arrow Data Import and Export
• Summary and Outlook
2
© 2022 Neo4j, Inc. All rights reserved.
Before we start …
This is not an introduction to Neo4j Graph Data Science or Data Science in
general.
See the talks of my colleagues for more information:
Alicia: “Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science”
Zach: “Connecting Neo4j Graph Data Science into Your Data Ecosystem and Workflows”
Luke: “New! Neo4j AuraDS: The Fastest Way to Get Started”
Dave: “Achieve Blazing-Fast Ingest Speeds with Apache Arrow”
3
© 2022 Neo4j, Inc. All rights reserved.
Neo4j Graph Data Science
4
© 2022 Neo4j, Inc. All rights reserved.
Overview
• Main features
◦ in-memory property graph optimized for executing algorithms
◦ graph catalog for managing multiple named in-memory graphs
◦ vast collection of graph and machine learning algorithms
◦ accessible via Cypher procedures and GDS Python Client
• Community and Enterprise versions
◦ Enterprise contains improved data structures for graph representation
◦ Enterprise has additional features (e.g. Apache Arrow, Clustering support, …)
◦ Community is entirely open source; Enterprise code is closed source
5
© 2022 Neo4j, Inc. All rights reserved.
A generic GDS workflow
1 Graph Projection
2 Algorithm Execution
3 Data Export
6
© 2022 Neo4j, Inc. All rights reserved.
A generic GDS workflow
1 Graph Projection
2 Algorithm Execution
3 Data Export
In GDS 2.1
In GDS 2.1
7
© 2022 Neo4j, Inc. All rights reserved.
A generic GDS workflow
CALL gds.graph.project
CALL gds.wcc.mutate
CALL gds.pageRank.mutate
CALL gds.louvain.mutate
...
CALL gds.graph.[stream,write]NodeProperties
CALL gds.graph.[stream,write]RelationshipProperties
...
doACTION(CREATE_GRAPH, ...)
doPUT(NODE_STREAM)
doPUT(RELATIONSHIP_STREAM)
doGET(NODE_PROPERTIES)
doGET(RELATIONSHIP_PROPERTIES)
...
8
© 2022 Neo4j, Inc. All rights reserved.
Scalability Challenges
9
© 2022 Neo4j, Inc. All rights reserved.
Scalability challenges - Customer facing
• Data size
◦ Customers with multiple billion nodes and relationships
◦ Not only topology, but also node and relationship properties
◦ Algorithm results often need to be stored in-memory
◦ Requires a compact in-memory graph representation
• Data import
◦ Customers want their graphs projected within minutes / hours, not days
◦ Requires parallel, non-blocking graph construction
◦ Customers may want their data to be projected from sources other than Neo4j
◦ Requires parallel, structured, compressed data streaming
10
© 2022 Neo4j, Inc. All rights reserved.
Scalability challenges - Customer facing
• Algorithm execution
◦ Customers want their results computed as fast as possible (within minutes)
◦ Customer graphs vary in their topology (power-law vs uniform distributions)
◦ Requires parallel, topology-aware algorithm execution
• Data export
◦ Customers often export algorithm results for downstream analysis
◦ Requires parallel, structured, compressed data streaming
11
© 2022 Neo4j, Inc. All rights reserved.
Scalability challenges - Developer facing
• Java is a memory-managed, garbage-collected language
◦ Garbage Collector / Object overhead becomes non-negligible at scale
◦ Requires usage of primitive data types instead of objects
• Java has no generic data structures for handling primitive data types
◦ List<int> is not supported in Java
◦ long[] is limited to ~2.1 bn entries
◦ Requires custom data structure implementations
12
© 2022 Neo4j, Inc. All rights reserved.
Neo4j GDS under the microscope
Huge Data Structures
13
© 2022 Neo4j, Inc. All rights reserved.
Huge Data Structures
• “Java has no generic data structures for handling primitive data types”
◦ int (4 Byte) vs. Integer (16 Byte), long (8 Byte) vs. Long (24 Byte)
◦ List<long> vs. List<Long> vs. long[]
• GDS uses High Performance Primitive Collections (hppc)
◦ non-thread-safe, primitive lists, sets and maps
• GDS provides “huge” collections for more than 2 bn elements
◦ Huge[Byte,Int,Long,Float,Double]Array
◦ HugeSparse[Byte,Int,Long,Float,Double]Array
◦ HugeSparse[Byte,Int,Long,Float,Double]ArrayList
◦ HugeAtomic[Byte,Int,Long,Float,Double]Array
◦ implemented to achieve equivalent performance as primitive arrays
https://github.com/carrotsearch/hppc
14
© 2022 Neo4j, Inc. All rights reserved.
Huge Data Structures
index 0 1 2 3 4 5 6 7 .. n
value 34 55 89 144 233 377 610 987 .. 42
HugeLongArray
index 0 1 2 3 4 5 6 7 .. n
value 34 55 89 144 null .. 42
HugeSparseLongArray
page (long[])
15
© 2022 Neo4j, Inc. All rights reserved.
Neo4j GDS under the microscope
Graph Projection
16
© 2022 Neo4j, Inc. All rights reserved.
● “Customers with multiple billion nodes and relationships”
○ Requires a compact in-memory graph representation
The in-memory graph
Node projection Relationship projection
17
© 2022 Neo4j, Inc. All rights reserved.
● Nodes := Id Mapping + Node Labels + Node Properties
The in-memory graph - Nodes
id origId
0 42
1 1337
2 1984
..
n 0
id
0 1 0 0 1
1 0 1 0 0
2 0 1 0 1
..
n 1 0 0 0
:= + +
id foo bar
0 73301 [13, 37]
1 78717 [98]
2 78717 [1, 3, 4]
..
n 78751 [12, 13]
18
© 2022 Neo4j, Inc. All rights reserved.
● Nodes := Id Mapping + Node Labels + Node Properties
The in-memory graph - Nodes
id origId
0 42
1 1337
2 1984
..
n 0
id
0 1 0 0 1
1 0 1 0 0
2 0 1 0 1
..
n 1 0 0 0
:= + +
id foo bar
0 73301 [13, 37]
1 78717 [98]
2 78717 [1, 3, 4]
..
n 78751 [12, 13]
19
© 2022 Neo4j, Inc. All rights reserved.
● Node id space in the original data is often sparse
○ Neo4j: store fragmentation and/or subgraph projection
○ Arrow: customer data contains arbitrary 64 Bit long ids
● GDS uses a consecutive node id space internally: [0..nodeCount)
○ favors use of array-based data structures
○ simplifies algorithm implementation for (long id = 0; id < nc; id++)
○ favors relationship compression
● Id Mapping has two main methods
○ toInternalId(long originalId) -> long
○ toOriginalId(long internalId) -> long
● Different implementations for Community and Enterprise
The in-memory graph - Nodes - Id Mapping
20
© 2022 Neo4j, Inc. All rights reserved.
● Community Edition - ArrayIdMap
The in-memory graph - Nodes - Id Mapping
id origId
0 42
1 1337
2 1984
...
n ..
class ArrayIdMap {
HugeLongArray originalIds; // maps from internal id to original id
HugeSparseLongArray internalIds; // maps from original id to internal id
long toOriginalId(long internalId) { return originalIds.get(internalId); }
long toInternalId(long originalId) { return internalIds.get(originalId); }
}
0 1 2 3 4 5 6 7 .. n
42 1337 1984 1985 1986 43 41 1987 .. ..
originalIds
0 1 2 3 4 5 6 7 .. 40 41 42 43 .. m
8 -1 -1 9 null .. -1 6 0 5 .. 7
internalIds
That’s 2*8 = 16 Byte for a
single id-to-id mapping.
21
© 2022 Neo4j, Inc. All rights reserved.
● Enterprise Edition - BitIdMap
The in-memory graph - Nodes - Id Mapping
id origId
0 42
1 1337
2 1984
...
n ..
class BitIdMap {
LongLongBitMap map; // maps between both id spaces
long toOriginalId(long internalId) { return map.getKey(internalId); }
long toInternalId(long originalId) { return map.getValue(originalId); }
}
page 0
[0,63]
.. page 20
[1280,1343]
.. page 31
[1984,2047]
..
0x40000000000 .. 0x200000000000000 .. 0x1 ..
pages
long[]
Each page represents
64 original ids in a single
long value.
That’s 1 Bit for a single
id-to-id mapping.
pages[0] = 00000000 00000000 00000100 00000000 00000000 00000000 00000000 00000000
22
© 2022 Neo4j, Inc. All rights reserved.
The in-memory graph - Nodes - Id Mapping
node
count
max original
node id
ArrayIdMap BitIdMap Reduction per
Mapping
1 bn 100 bn [15 GiB ... 752 GiB]
[~16 … 807] Byte / Mapping
12 GiB
~12 Byte / Mapping
~ 32 x
100 bn 100 bn 1490 GiB
~16 Byte / Mapping
12 GiB
~1.03 Bit / Mapping
~ 128 x
BitIdMap: 12386 MiB
|-- this.instance: 32 Bytes
|-- Identifier mapping: 12386 MiB
|-- this.instance: 48 Bytes
|-- pages: 11920 MiB
|-- block offsets: 186 MiB
|-- sorted block offsets: 186 MiB
|-- block mapping: 93 MiB
|-- Node Label BitSets: 0 Bytes
ArrayIdMap: [15356 MiB ... 752 GiB]
|-- this.instance: 40 Bytes
|-- original identifiers: 7630 MiB
|-- internal identifiers: [7726 MiB ... 745 GiB]
|-- Node Label BitSets: 0 Bytes
Community Edition Enterprise Edition
23
© 2022 Neo4j, Inc. All rights reserved.
• BitSet per label
◦ constant lookup and efficient for label filters (union)
◦ memory consumption depends on node count: ~ nodeCount Bit
◦ 1 bn nodes ~ 120 MiB
The in-memory graph - Nodes - Node Labels
id origId
0 42
1 1337
2 1984
...
n 0
id
0 1 0 0 1
1 0 1 0 0
2 0 1 0 1
...
n 1 0 0 0
:= + +
id foo bar
0 73301 [13, 37]
1 78717 [98]
2 78717 [1, 3, 4]
...
n 78751 [12, 13]
24
© 2022 Neo4j, Inc. All rights reserved.
• HugeSparse*Array per property
◦ constant lookup
◦ memory consumption depends on node count: ~ nodeCount * sizeOf(type)
◦ 1 bn long properties ~ 7634 MiB
The in-memory graph - Nodes - Node Properties
id origId
0 42
1 1337
2 1984
...
n 0
id
0 1 0 0 1
1 0 1 0 0
2 0 1 0 1
...
n 1 0 0 0
:= + +
id foo bar
0 73301 [13, 37]
1 78717 [98]
2 78717 [1, 3, 4]
...
n 78751 [12, 13]
25
© 2022 Neo4j, Inc. All rights reserved.
● Relationships := AdjacencyList + Properties
The in-memory graph - Relationships
source target 0 target 1 target 2 target 3
1337 2001 1984 2005
foo 2.1 1.5 4.3
2003 3456 3442 1992 2005
foo 2.5 24.6 4.2 5.7
:=
source target 0 target 1 target 2
1234 3323 3323 4346
bar 4.2 2.1 8.4
26
© 2022 Neo4j, Inc. All rights reserved.
● Accessing the adjacency of a node is the most important operation
○ adjacency must be accessible in constant time
○ adjacency must be stored “compact” in memory for cache efficiency
○ adjacency must be optimized for graph traversal
● Real-world graphs are sparse
○ each node is only connected to a subset of other nodes
○ relationship distributions vary depending on the use case
● Adjacency List (AL) has two main methods
○ AL::forEachRelationship(long nodeId, (src, tgt) -> {})
○ AL::forEachRelationship(long nodeId, (src, tgt, prop) -> {})
● Same implementation for Community and Enterprise
The in-memory graph - Relationships
Main API for
algorithm
implementers
27
© 2022 Neo4j, Inc. All rights reserved.
Storing targets
uncompressed consumes
relationship count * 8 Byte
100 bn rels ~ 745 GiB
The in-memory graph - Relationships - CSR
• Adjacency List implementation: Compressed Sparse Row Variant
0 ... 1337 ... 2003 ... n
0 ... 4096 ... 5432 ... ...
offsets
HugeLongArray
0 ... 4096 - 4098 5432 - 5435
0 ... 1984 2001 2005 ... 1992 2005 3442 3456 ...
targets
long[][]
0 ... 1337 ... 2003 ... n
0 ... 3 ... 4 ... ...
degrees
HugeIntArray
https://en.wikipedia.org/wiki/Sparse_matrix
28
© 2022 Neo4j, Inc. All rights reserved.
• Adjacency List implementation: Compressed Sparse Row Variant
◦ Target compression is applied during graph projection
The in-memory graph - Relationships - CSR
1984 2001 2005
sorting
delta encoding
1992 2005 3442 3456
source node 1337
1984 17 4 1992 13 1547 14
source node 2003
variable-length
encoding
2001 1984 2005
initial target lists 3456 3442 1992 2005
0x
40
0x
8F
0x
91
0x
84
0x
48
0x
8F
0x
8D
0x
1D
0x
8B
0x
84
32 Byte
6 Byte
24 Byte
4 Byte
29
© 2022 Neo4j, Inc. All rights reserved.
The in-memory graph - Relationships - CSR
• Adjacency List implementation: Compressed Sparse Row Variant
0 ... 1337 ... 2003 ... n
0 ... 2100 ... 3205 ... ...
offsets
HugeLongArray
targets
byte[][]
0 ... 1337 ... 2003 ... n
0 ... 3 ... 4 ... ...
degrees
HugeIntArray
0 .. 2100 - 2103 .. 3205 - 3210 ...
0 .. 0x
40
0x
8F
0x
91
0x
84
.. 0x
48
0x
8F
0x
8D
0x
1D
0x
8B
0x
84
...
30
© 2022 Neo4j, Inc. All rights reserved.
The in-memory graph - Relationships - Properties
• Relationship properties are also stored in CSR representation
◦ in the exact same order as the corresponding target node ids
◦ allows lock-step iteration of target ids and relationship values
targets
byte[][]
0 .. 6132 - 6134 .. 8192 - 8195 ..
0 .. 1.5 2.1 4.3 .. 4.2 5.7 24.6 2.5 ..
properties
long[][]
0 .. 2100 - 2103 .. 3205 - 3210 ..
0 .. 0x
40
0x
8F
0x
91
0x
84
.. 0x
48
0x
8F
0x
8D
0x
1D
0x
8B
0x
84
..
1984 2001 2005 1992 2005 3442 3456
31
© 2022 Neo4j, Inc. All rights reserved.
The in-memory graph - Relationships - Memory
node count relationship count uncompressed values compressed values
1 bn 100 bn ~ 756 GiB
~8.2 Byte / Value
[106 GiB ... 382 GiB]
~1.4 Byte / Value
default for topology
CompressedAdjacencyList: [106 GiB ... 382 GiB]
|-- this.instance: 24 Bytes
|-- pages: [95 GiB ... 371 GiB]
|-- degrees: 3815 MiB
|-- offsets: 7630 MiB
UncompressedAdjacencyList: 756 GiB
|-- this.instance: 24 Bytes
|-- pages: 745 GiB
|-- degrees: 3815 MiB
|-- offsets: 7630 MiB
used for rel properties
32
© 2022 Neo4j, Inc. All rights reserved.
The in-memory graph - Native Graph Projection
● Nodes and relationships are read via sequential scans of record stores
○ scan is IO-friendly since we touch a page only once (no random reads)
○ allows us to use a small page cache (~ page size * 100 * concurrency)
○ scales nearly linear with number of threads
● Nodes are read from record store
○ can also make use of node label index and property index if part of projection
○ threads read chunks of node records and forward them to thread-safe Id
Mapping-, Label- and Property-Builders
● Relationships are read from relationship record store
○ threads read chunks of relationships records, group them by source or target,
pre-compress and store partial adjacency lists in intermediate list
○ multi-threaded compression of partially compressed adjacency lists into CSR
33
© 2022 Neo4j, Inc. All rights reserved.
Native Graph Projection - Scale Testing
• Imported Graph500* artificial data sets into Neo4j
◦ power-law relationship distribution, no properties
• Azure M416ms, 416 vCPU, 11TB RAM
◦ Neo4j with ~9 TB Heap and ~1 TB PageCache
◦ gds.graph.project with readConcurrency: 416
*Graph 500 Benchmark specification https://graph500.org/
scale node count (2^scale) relationship count (2^(scale + 4)) Neo4j disk size [GB]
30 1,073,741,824 17,179,869,184 1113
31 2,147,483,648 34,359,738,368 2226
32 4,294,967,296 68,719,476,736 4452
33 8,589,934,592 137,438,953,472 9272
34
© 2022 Neo4j, Inc. All rights reserved.
Native Graph Projection - Scale Testing
Known problem. GDS
2.2 (fall ‘22) will improve
projection performance
for large scale graphs.
35
© 2022 Neo4j, Inc. All rights reserved.
Native Graph Projection - Scale Testing
36
© 2022 Neo4j, Inc. All rights reserved.
Neo4j GDS under the microscope
Algorithm Execution
37
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Compute template
• “Customers want their results computed as fast as possible.”
◦ Requires parallel, topology-aware algorithm execution
◦ Requires thread-safe, efficient data structures
Most GDS algorithms follow the same “compute pattern” for scalability:
1. Procedure call, e.g. CALL gds.wcc.mutate(...)
2. Input validation (meta data and algo configuration)
3. Graph Catalog read (potentially label / type filtered)
4. Compute
a. initialize algorithm specific data structures (e.g. Huge[Atomic]*Array)
b. partition node id space based on concurrency (range or degree partitioned)
c. create a task for each partition (task captures the algorithm logic)
d. run tasks in parallel (optionally repeat until termination criteria is reached)
5. Process result according to execution mode (stream, mutate, write, stats)
38
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Compute template
• Step 4 “Compute” can be implemented with one of two internal APIs:
• The Graph API (internal)
◦ Requires knowledge about GDS internals
◦ Most expressive API for manual optimization
◦ Often faster, but also more complex
◦ Developer needs to take care of partitioning, parallelism, thread-safety etc.
• The Pregel* API (external)
◦ Implemented on top of Graph API
◦ Requires less knowledge about GDS internals
◦ Less expressive API tailored for iterative, message-passing-based algorithms
◦ Automatic parallelization and load balancing using Fork-Join execution
*Malewicz et al: “Pregel: A System for Large-Scale Graph Processing”, Google
39
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Weakly Connected Components
40
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Weakly Connected Components
Wcc::compute(Graph g, Config c) -> Components {
dss = init(g.nodeCount()) // DisjointSetStruct
tasks = rangePartition(g.nodeCount()).map(p -> new WccTask(g, p, dss))
executeParallel(tasks, c.concurrency())
return Components.of(dss)
}
class WccTask(Graph g, Partition p, DisjointSetStruct dss) {
run() {
p.forEachNode(node ->
g.forEachRelationship(node, (s, t) -> dss.union(s, t))
);
}
}
41 Implementation: https://tinyurl.com/mr2ayp75
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Weakly Connected Components
Wcc::compute(Graph g, Config c) -> Components {
dss = init(g.nodeCount()) // HugeAtomicDisjointSetStruct
tasks = rangePartition(g.nodeCount()).map(p -> new WccTask(g, p, dss))
executeParallel(tasks, c.concurrency())
return Components.of(dss)
}
class WccTask(Graph g, Partition p, DisjointSetStruct dss) {
run() {
p.forEachNode(node ->
g.forEachRelationship(node, (s, t) -> dss.union(s, t))
);
}
}
// also called Union-Find
class DisjointSetStruct {
union(long idA, long idB);
setIdOf(long id) -> long;
}
Input: {{0}, {42}, {3}, {6}, {54}, {1337}}
dss.union(0, 42)
dss.union(3, 6)
dss.union(42, 1337)
Output: {{0, 42, 1337}, {3, 6}, {54}}
dss.setIdOf(42) // 0
dss.setIdOf(1337)// 0
dss.setIdOf(6) // 3
dss.setIdOf(54) // 54
42
© 2022 Neo4j, Inc. All rights reserved. Implementation: https://tinyurl.com/mr2ayp75
Algorithm Execution - Weakly Connected Components
Wcc::compute(Graph g, Config c) -> Components {
dss = init(g.nodeCount()) // DisjointSetStruct
tasks = rangePartition(g.nodeCount()).map(p -> new WccTask(g, p, dss))
executeParallel(tasks, c.concurrency())
return Components.of(dss)
}
class WccTask(Graph g, Partition p, DisjointSetStruct dss) {
run() {
p.forEachNode(node ->
g.forEachRelationship(node, (s, t) -> dss.union(s, t))
);
}
}
Thread-safe,
wait-free
implementation
Graph uses thread-local
decompressing cursors
43
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - WCC - Scale Testing
44
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Page Rank
45
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Page Rank
• PageRank is implemented using the Pregel API
◦ initialization, partitioning and task execution is handled by Pregel
◦ provides tailored queue implementation for message passing
◦ developer can focus on the algorithm logic in a vertex-centric computation
PageRank::compute(Messages messages, Context c) {
rank = c.getValue(PAGE_RANK)
if (c.superstep() > 0) {
sum = messages.sum()
rank = ((1 - dampingFactor) / c.nodeCount()) + dampingFactor * sum
c.setValue(PAGE_RANK, rank)
}
c.sendToNeighbors(rank)
}
Implementation: https://tinyurl.com/dpyhmmkw
Called in parallel
for each node in
each superstep
Messages sent by
other nodes in the
previous superstep
Page Rank
specific
logic
Send new
message to
neighbors
46
© 2022 Neo4j, Inc. All rights reserved.
Algorithm Execution - Page Rank - Scale Testing
47
© 2022 Neo4j, Inc. All rights reserved.
Neo4j GDS under the microscope
Arrow Data Import and Export
48
© 2022 Neo4j, Inc. All rights reserved.
Apache Arrow
• “Customers may want their data to be projected from sources other than Neo4j”
• “Customers often export algorithm results for downstream analysis”
• Requires parallel, structured, compressed data streaming
• Arrow is a language-independent columnar
memory format for flat and hierarchical data:
◦ Data is organized for efficient analytic
operations on CPU and GPU
◦ Arrow Flight is used to move Arrow data
efficiently between processes and machines
◦ Written in C++ with wrappers for Java,
Rust, Python, C/C#, Matlab, R, etc.
Check out Dave’s talk:
“Achieve Blazing-Fast Ingest Speeds with Apache Arrow”
https://arrow.apache.org/
49
© 2022 Neo4j, Inc. All rights reserved.
Graph Import via Apache Arrow
• GDS Arrow Flight Server is started /
stopped with the DBMS
• Arrow Flight Client must follow
protocol for importing data
• Import phases are indicated by
messages
• Record batches can be sent in
parallel by using multiple clients
Arrow Flight
Client
GDS Arrow
Flight Server
CREATE_GRAPH “my_graph”
ACK
Send Nodes Flight stream
NODE_LOAD_DONE “my_graph”
ACK
Send Relationships Flight stream
RELATIONSHIP_LOAD_DONE “my_graph”
ACK In-memory
graph
1
2
3
50 https://neo4j.com/docs/graph-data-science/current/graph-project-apache-arrow/
© 2022 Neo4j, Inc. All rights reserved.
• GET actions mirror stream node and
relationship property procedures
• Output streams are produced in
parallel and can be consumed in
parallel
Property Export via Apache Arrow
GET Property Message
{
graph_name: “my_graph”,
procedure: “gds.graph.streamNodeProperty”,
config: {
node_labels: [“A”, “B”],
node_property: “foobar”
}
}
1
Send Node Property Flight stream
In-memory
graph
Arrow Flight
Client
GDS Arrow
Flight Server
51 https://neo4j.com/docs/graph-data-science/current/graph-catalog-apache-arrow-ops/
© 2022 Neo4j, Inc. All rights reserved.
Arrow Data Import - Scale testing
• Graph 500 data set scale 32 (~4 bn nodes, ~68 bn relationships)
◦ 1 node property, 3 relationship properties
◦ Data stored in Google Storage
◦ Node files: 476 parquet files 101.7 MB each
◦ Relationship files: 50,001 parquet files 41.1 MB each
• Server: GCP m1-ultramem-160 (160vCPU - 3844GB RAM)
◦ Neo4j with 3.3 TB Heap and 12 GiB PageCache
• Ingest via Beam/Dataflow Configuration
◦ 68 dataflow workers (n2-standard-2)
◦ 128 concurrenct server-side graph creation threads
52
© 2022 Neo4j, Inc. All rights reserved.
Arrow Data Import - Scale testing
● GDS Graph Load in ~3.5 hours
○ Nodes: 4,294,967,296 at a rate of ~6.2 M nodes/s
○ Relationships: 68,719,476,736 at a rate of ~8.0 M relationships/s
○ Overall throughput: 5.5M objects/s
Nodes
(11.5m)
Edges (2h 23m)
Graph
Creation
(55m)
53
© 2022 Neo4j, Inc. All rights reserved.
Summary and Outlook
54
© 2022 Neo4j, Inc. All rights reserved.
Summary - Scalability challenges
• Data size
◦ Requires a compact in-memory graph representation
◦ Compact Node Id Mappings and Compressed CSR data structure
• Algorithm execution
◦ Requires parallel, topology-aware algorithm execution
◦ Low-level Graph API to achieve the optimal performance on a JVM
◦ Mid-level APIs to simplify the development of algorithms
• Data import / export
◦ Requires non-blocking graph construction
◦ Parallel construction of node and relationship data structures
◦ Requires parallel, structured, compressed data streaming
◦ Apache Arrow provides a fast mechanism for data import / export
55
© 2022 Neo4j, Inc. All rights reserved.
Outlook
• Property compression
◦ Node and relationship properties are currently stored uncompressed
◦ Mutate leads to many properties stored in-memory
• Algorithm execution
◦ Alternative algorithm execution frameworks
◦ Off-load Machine Learning workloads to native libraries (e.g. tensorflow)
• Graph representation
◦ Allow to only load parts of the graph into memory to enable large-scale
computation with limited ressources
• Integration
◦ Apache Arrow opens a whole world of systems to interact with
◦ Access to this world needs to be made simple for customers
56
© 2022 Neo4j, Inc. All rights reserved.
Thank you
martin.junghanns@neo4j.com
https://github.com/s1ck
@kc1s
Special thanks to
Kent Stroker, Sr. Sales Engineer at Neo4j
for setting up and running the GDS project and algorithm scale tests and
Dave Voutila, Sales Engineering Manager at Neo4j
for setting up and running the GDS Arrow scale tests

More Related Content

What's hot

Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j
 
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxGraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
jexp
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
Max De Marzi
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global Travel
Neo4j
 
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptxEncrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Neo4j
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
Neo4j
 
Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentation
jexp
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
jexp
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
Neo4j
 
RDBMS to Graph
RDBMS to GraphRDBMS to Graph
RDBMS to Graph
Neo4j
 
Training Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL LibraryTraining Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL Library
Neo4j
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
Neo4j
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
Neo4j
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Neo4j
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
Max De Marzi
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
Tobias Lindaaker
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
Neo4j
 
Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...
Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...
Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...
Neo4j
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Neo4j
 

What's hot (20)

Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic training
 
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxGraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global Travel
 
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptxEncrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentation
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
 
RDBMS to Graph
RDBMS to GraphRDBMS to Graph
RDBMS to Graph
 
Training Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL LibraryTraining Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL Library
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
 
NoSql
NoSqlNoSql
NoSql
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
 
Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...
Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...
Neo4j GraphSummit London - The Path To Success With Graph Database and Data S...
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 

Similar to Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science

GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
Neo4j
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Neo4j
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB
 
Master Real-Time Streams With Neo4j and Apache Kafka
Master Real-Time Streams With Neo4j and Apache KafkaMaster Real-Time Streams With Neo4j and Apache Kafka
Master Real-Time Streams With Neo4j and Apache Kafka
Neo4j
 
Fun with Fabric in 15
Fun with Fabric in 15Fun with Fabric in 15
Fun with Fabric in 15
Neo4j
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Neo4j
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Workshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data ScienceWorkshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data Science
Neo4j
 
GraphSummit Toronto: Keynote - Innovating with Graphs
GraphSummit Toronto: Keynote - Innovating with Graphs GraphSummit Toronto: Keynote - Innovating with Graphs
GraphSummit Toronto: Keynote - Innovating with Graphs
Neo4j
 
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the CloudNew! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
Neo4j
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...
A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...
A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...
Neo4j
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
AMD Developer Central
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
MongoDB
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
Redis Labs
 
日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB
griddb
 

Similar to Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science (20)

GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
Master Real-Time Streams With Neo4j and Apache Kafka
Master Real-Time Streams With Neo4j and Apache KafkaMaster Real-Time Streams With Neo4j and Apache Kafka
Master Real-Time Streams With Neo4j and Apache Kafka
 
Fun with Fabric in 15
Fun with Fabric in 15Fun with Fabric in 15
Fun with Fabric in 15
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Workshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data ScienceWorkshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data Science
 
GraphSummit Toronto: Keynote - Innovating with Graphs
GraphSummit Toronto: Keynote - Innovating with Graphs GraphSummit Toronto: Keynote - Innovating with Graphs
GraphSummit Toronto: Keynote - Innovating with Graphs
 
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the CloudNew! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...
A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...
A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clus...
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB
 

More from Neo4j

Atelier - Architecture d’applications de Graphes - GraphSummit Paris
Atelier - Architecture d’applications de Graphes - GraphSummit ParisAtelier - Architecture d’applications de Graphes - GraphSummit Paris
Atelier - Architecture d’applications de Graphes - GraphSummit Paris
Neo4j
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
FLOA - Détection de Fraude - GraphSummit Paris
FLOA -  Détection de Fraude - GraphSummit ParisFLOA -  Détection de Fraude - GraphSummit Paris
FLOA - Détection de Fraude - GraphSummit Paris
Neo4j
 
SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...
SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...
SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...
Neo4j
 
ADEO - Knowledge Graph pour le e-commerce, entre challenges et opportunités ...
ADEO -  Knowledge Graph pour le e-commerce, entre challenges et opportunités ...ADEO -  Knowledge Graph pour le e-commerce, entre challenges et opportunités ...
ADEO - Knowledge Graph pour le e-commerce, entre challenges et opportunités ...
Neo4j
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
Neo4j
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
Neo4j
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
Neo4j
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
Neo4j
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Neo4j
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
Neo4j
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Neo4j
 

More from Neo4j (20)

Atelier - Architecture d’applications de Graphes - GraphSummit Paris
Atelier - Architecture d’applications de Graphes - GraphSummit ParisAtelier - Architecture d’applications de Graphes - GraphSummit Paris
Atelier - Architecture d’applications de Graphes - GraphSummit Paris
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
FLOA - Détection de Fraude - GraphSummit Paris
FLOA -  Détection de Fraude - GraphSummit ParisFLOA -  Détection de Fraude - GraphSummit Paris
FLOA - Détection de Fraude - GraphSummit Paris
 
SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...
SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...
SOPRA STERIA - GraphRAG : repousser les limitations du RAG via l’utilisation ...
 
ADEO - Knowledge Graph pour le e-commerce, entre challenges et opportunités ...
ADEO -  Knowledge Graph pour le e-commerce, entre challenges et opportunités ...ADEO -  Knowledge Graph pour le e-commerce, entre challenges et opportunités ...
ADEO - Knowledge Graph pour le e-commerce, entre challenges et opportunités ...
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 

Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science

  • 1. © 2022 Neo4j, Inc. All rights reserved. Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science Martin Junghanns Senior Software Engineer Neo4j - Product Engineering
  • 2. © 2022 Neo4j, Inc. All rights reserved. Outline • The Neo4j Graph Data Science (GDS) Platform ◦ Overview ◦ A generic GDS workflow • Scalability Challenges • Neo4j GDS under the microscope ◦ Graph Projection ◦ “Huge” Data Structures ◦ Algorithm Execution ◦ Arrow Data Import and Export • Summary and Outlook 2
  • 3. © 2022 Neo4j, Inc. All rights reserved. Before we start … This is not an introduction to Neo4j Graph Data Science or Data Science in general. See the talks of my colleagues for more information: Alicia: “Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science” Zach: “Connecting Neo4j Graph Data Science into Your Data Ecosystem and Workflows” Luke: “New! Neo4j AuraDS: The Fastest Way to Get Started” Dave: “Achieve Blazing-Fast Ingest Speeds with Apache Arrow” 3
  • 4. © 2022 Neo4j, Inc. All rights reserved. Neo4j Graph Data Science 4
  • 5. © 2022 Neo4j, Inc. All rights reserved. Overview • Main features ◦ in-memory property graph optimized for executing algorithms ◦ graph catalog for managing multiple named in-memory graphs ◦ vast collection of graph and machine learning algorithms ◦ accessible via Cypher procedures and GDS Python Client • Community and Enterprise versions ◦ Enterprise contains improved data structures for graph representation ◦ Enterprise has additional features (e.g. Apache Arrow, Clustering support, …) ◦ Community is entirely open source; Enterprise code is closed source 5
  • 6. © 2022 Neo4j, Inc. All rights reserved. A generic GDS workflow 1 Graph Projection 2 Algorithm Execution 3 Data Export 6
  • 7. © 2022 Neo4j, Inc. All rights reserved. A generic GDS workflow 1 Graph Projection 2 Algorithm Execution 3 Data Export In GDS 2.1 In GDS 2.1 7
  • 8. © 2022 Neo4j, Inc. All rights reserved. A generic GDS workflow CALL gds.graph.project CALL gds.wcc.mutate CALL gds.pageRank.mutate CALL gds.louvain.mutate ... CALL gds.graph.[stream,write]NodeProperties CALL gds.graph.[stream,write]RelationshipProperties ... doACTION(CREATE_GRAPH, ...) doPUT(NODE_STREAM) doPUT(RELATIONSHIP_STREAM) doGET(NODE_PROPERTIES) doGET(RELATIONSHIP_PROPERTIES) ... 8
  • 9. © 2022 Neo4j, Inc. All rights reserved. Scalability Challenges 9
  • 10. © 2022 Neo4j, Inc. All rights reserved. Scalability challenges - Customer facing • Data size ◦ Customers with multiple billion nodes and relationships ◦ Not only topology, but also node and relationship properties ◦ Algorithm results often need to be stored in-memory ◦ Requires a compact in-memory graph representation • Data import ◦ Customers want their graphs projected within minutes / hours, not days ◦ Requires parallel, non-blocking graph construction ◦ Customers may want their data to be projected from sources other than Neo4j ◦ Requires parallel, structured, compressed data streaming 10
  • 11. © 2022 Neo4j, Inc. All rights reserved. Scalability challenges - Customer facing • Algorithm execution ◦ Customers want their results computed as fast as possible (within minutes) ◦ Customer graphs vary in their topology (power-law vs uniform distributions) ◦ Requires parallel, topology-aware algorithm execution • Data export ◦ Customers often export algorithm results for downstream analysis ◦ Requires parallel, structured, compressed data streaming 11
  • 12. © 2022 Neo4j, Inc. All rights reserved. Scalability challenges - Developer facing • Java is a memory-managed, garbage-collected language ◦ Garbage Collector / Object overhead becomes non-negligible at scale ◦ Requires usage of primitive data types instead of objects • Java has no generic data structures for handling primitive data types ◦ List<int> is not supported in Java ◦ long[] is limited to ~2.1 bn entries ◦ Requires custom data structure implementations 12
  • 13. © 2022 Neo4j, Inc. All rights reserved. Neo4j GDS under the microscope Huge Data Structures 13
  • 14. © 2022 Neo4j, Inc. All rights reserved. Huge Data Structures • “Java has no generic data structures for handling primitive data types” ◦ int (4 Byte) vs. Integer (16 Byte), long (8 Byte) vs. Long (24 Byte) ◦ List<long> vs. List<Long> vs. long[] • GDS uses High Performance Primitive Collections (hppc) ◦ non-thread-safe, primitive lists, sets and maps • GDS provides “huge” collections for more than 2 bn elements ◦ Huge[Byte,Int,Long,Float,Double]Array ◦ HugeSparse[Byte,Int,Long,Float,Double]Array ◦ HugeSparse[Byte,Int,Long,Float,Double]ArrayList ◦ HugeAtomic[Byte,Int,Long,Float,Double]Array ◦ implemented to achieve equivalent performance as primitive arrays https://github.com/carrotsearch/hppc 14
  • 15. © 2022 Neo4j, Inc. All rights reserved. Huge Data Structures index 0 1 2 3 4 5 6 7 .. n value 34 55 89 144 233 377 610 987 .. 42 HugeLongArray index 0 1 2 3 4 5 6 7 .. n value 34 55 89 144 null .. 42 HugeSparseLongArray page (long[]) 15
  • 16. © 2022 Neo4j, Inc. All rights reserved. Neo4j GDS under the microscope Graph Projection 16
  • 17. © 2022 Neo4j, Inc. All rights reserved. ● “Customers with multiple billion nodes and relationships” ○ Requires a compact in-memory graph representation The in-memory graph Node projection Relationship projection 17
  • 18. © 2022 Neo4j, Inc. All rights reserved. ● Nodes := Id Mapping + Node Labels + Node Properties The in-memory graph - Nodes id origId 0 42 1 1337 2 1984 .. n 0 id 0 1 0 0 1 1 0 1 0 0 2 0 1 0 1 .. n 1 0 0 0 := + + id foo bar 0 73301 [13, 37] 1 78717 [98] 2 78717 [1, 3, 4] .. n 78751 [12, 13] 18
  • 19. © 2022 Neo4j, Inc. All rights reserved. ● Nodes := Id Mapping + Node Labels + Node Properties The in-memory graph - Nodes id origId 0 42 1 1337 2 1984 .. n 0 id 0 1 0 0 1 1 0 1 0 0 2 0 1 0 1 .. n 1 0 0 0 := + + id foo bar 0 73301 [13, 37] 1 78717 [98] 2 78717 [1, 3, 4] .. n 78751 [12, 13] 19
  • 20. © 2022 Neo4j, Inc. All rights reserved. ● Node id space in the original data is often sparse ○ Neo4j: store fragmentation and/or subgraph projection ○ Arrow: customer data contains arbitrary 64 Bit long ids ● GDS uses a consecutive node id space internally: [0..nodeCount) ○ favors use of array-based data structures ○ simplifies algorithm implementation for (long id = 0; id < nc; id++) ○ favors relationship compression ● Id Mapping has two main methods ○ toInternalId(long originalId) -> long ○ toOriginalId(long internalId) -> long ● Different implementations for Community and Enterprise The in-memory graph - Nodes - Id Mapping 20
  • 21. © 2022 Neo4j, Inc. All rights reserved. ● Community Edition - ArrayIdMap The in-memory graph - Nodes - Id Mapping id origId 0 42 1 1337 2 1984 ... n .. class ArrayIdMap { HugeLongArray originalIds; // maps from internal id to original id HugeSparseLongArray internalIds; // maps from original id to internal id long toOriginalId(long internalId) { return originalIds.get(internalId); } long toInternalId(long originalId) { return internalIds.get(originalId); } } 0 1 2 3 4 5 6 7 .. n 42 1337 1984 1985 1986 43 41 1987 .. .. originalIds 0 1 2 3 4 5 6 7 .. 40 41 42 43 .. m 8 -1 -1 9 null .. -1 6 0 5 .. 7 internalIds That’s 2*8 = 16 Byte for a single id-to-id mapping. 21
  • 22. © 2022 Neo4j, Inc. All rights reserved. ● Enterprise Edition - BitIdMap The in-memory graph - Nodes - Id Mapping id origId 0 42 1 1337 2 1984 ... n .. class BitIdMap { LongLongBitMap map; // maps between both id spaces long toOriginalId(long internalId) { return map.getKey(internalId); } long toInternalId(long originalId) { return map.getValue(originalId); } } page 0 [0,63] .. page 20 [1280,1343] .. page 31 [1984,2047] .. 0x40000000000 .. 0x200000000000000 .. 0x1 .. pages long[] Each page represents 64 original ids in a single long value. That’s 1 Bit for a single id-to-id mapping. pages[0] = 00000000 00000000 00000100 00000000 00000000 00000000 00000000 00000000 22
  • 23. © 2022 Neo4j, Inc. All rights reserved. The in-memory graph - Nodes - Id Mapping node count max original node id ArrayIdMap BitIdMap Reduction per Mapping 1 bn 100 bn [15 GiB ... 752 GiB] [~16 … 807] Byte / Mapping 12 GiB ~12 Byte / Mapping ~ 32 x 100 bn 100 bn 1490 GiB ~16 Byte / Mapping 12 GiB ~1.03 Bit / Mapping ~ 128 x BitIdMap: 12386 MiB |-- this.instance: 32 Bytes |-- Identifier mapping: 12386 MiB |-- this.instance: 48 Bytes |-- pages: 11920 MiB |-- block offsets: 186 MiB |-- sorted block offsets: 186 MiB |-- block mapping: 93 MiB |-- Node Label BitSets: 0 Bytes ArrayIdMap: [15356 MiB ... 752 GiB] |-- this.instance: 40 Bytes |-- original identifiers: 7630 MiB |-- internal identifiers: [7726 MiB ... 745 GiB] |-- Node Label BitSets: 0 Bytes Community Edition Enterprise Edition 23
  • 24. © 2022 Neo4j, Inc. All rights reserved. • BitSet per label ◦ constant lookup and efficient for label filters (union) ◦ memory consumption depends on node count: ~ nodeCount Bit ◦ 1 bn nodes ~ 120 MiB The in-memory graph - Nodes - Node Labels id origId 0 42 1 1337 2 1984 ... n 0 id 0 1 0 0 1 1 0 1 0 0 2 0 1 0 1 ... n 1 0 0 0 := + + id foo bar 0 73301 [13, 37] 1 78717 [98] 2 78717 [1, 3, 4] ... n 78751 [12, 13] 24
  • 25. © 2022 Neo4j, Inc. All rights reserved. • HugeSparse*Array per property ◦ constant lookup ◦ memory consumption depends on node count: ~ nodeCount * sizeOf(type) ◦ 1 bn long properties ~ 7634 MiB The in-memory graph - Nodes - Node Properties id origId 0 42 1 1337 2 1984 ... n 0 id 0 1 0 0 1 1 0 1 0 0 2 0 1 0 1 ... n 1 0 0 0 := + + id foo bar 0 73301 [13, 37] 1 78717 [98] 2 78717 [1, 3, 4] ... n 78751 [12, 13] 25
  • 26. © 2022 Neo4j, Inc. All rights reserved. ● Relationships := AdjacencyList + Properties The in-memory graph - Relationships source target 0 target 1 target 2 target 3 1337 2001 1984 2005 foo 2.1 1.5 4.3 2003 3456 3442 1992 2005 foo 2.5 24.6 4.2 5.7 := source target 0 target 1 target 2 1234 3323 3323 4346 bar 4.2 2.1 8.4 26
  • 27. © 2022 Neo4j, Inc. All rights reserved. ● Accessing the adjacency of a node is the most important operation ○ adjacency must be accessible in constant time ○ adjacency must be stored “compact” in memory for cache efficiency ○ adjacency must be optimized for graph traversal ● Real-world graphs are sparse ○ each node is only connected to a subset of other nodes ○ relationship distributions vary depending on the use case ● Adjacency List (AL) has two main methods ○ AL::forEachRelationship(long nodeId, (src, tgt) -> {}) ○ AL::forEachRelationship(long nodeId, (src, tgt, prop) -> {}) ● Same implementation for Community and Enterprise The in-memory graph - Relationships Main API for algorithm implementers 27
  • 28. © 2022 Neo4j, Inc. All rights reserved. Storing targets uncompressed consumes relationship count * 8 Byte 100 bn rels ~ 745 GiB The in-memory graph - Relationships - CSR • Adjacency List implementation: Compressed Sparse Row Variant 0 ... 1337 ... 2003 ... n 0 ... 4096 ... 5432 ... ... offsets HugeLongArray 0 ... 4096 - 4098 5432 - 5435 0 ... 1984 2001 2005 ... 1992 2005 3442 3456 ... targets long[][] 0 ... 1337 ... 2003 ... n 0 ... 3 ... 4 ... ... degrees HugeIntArray https://en.wikipedia.org/wiki/Sparse_matrix 28
  • 29. © 2022 Neo4j, Inc. All rights reserved. • Adjacency List implementation: Compressed Sparse Row Variant ◦ Target compression is applied during graph projection The in-memory graph - Relationships - CSR 1984 2001 2005 sorting delta encoding 1992 2005 3442 3456 source node 1337 1984 17 4 1992 13 1547 14 source node 2003 variable-length encoding 2001 1984 2005 initial target lists 3456 3442 1992 2005 0x 40 0x 8F 0x 91 0x 84 0x 48 0x 8F 0x 8D 0x 1D 0x 8B 0x 84 32 Byte 6 Byte 24 Byte 4 Byte 29
  • 30. © 2022 Neo4j, Inc. All rights reserved. The in-memory graph - Relationships - CSR • Adjacency List implementation: Compressed Sparse Row Variant 0 ... 1337 ... 2003 ... n 0 ... 2100 ... 3205 ... ... offsets HugeLongArray targets byte[][] 0 ... 1337 ... 2003 ... n 0 ... 3 ... 4 ... ... degrees HugeIntArray 0 .. 2100 - 2103 .. 3205 - 3210 ... 0 .. 0x 40 0x 8F 0x 91 0x 84 .. 0x 48 0x 8F 0x 8D 0x 1D 0x 8B 0x 84 ... 30
  • 31. © 2022 Neo4j, Inc. All rights reserved. The in-memory graph - Relationships - Properties • Relationship properties are also stored in CSR representation ◦ in the exact same order as the corresponding target node ids ◦ allows lock-step iteration of target ids and relationship values targets byte[][] 0 .. 6132 - 6134 .. 8192 - 8195 .. 0 .. 1.5 2.1 4.3 .. 4.2 5.7 24.6 2.5 .. properties long[][] 0 .. 2100 - 2103 .. 3205 - 3210 .. 0 .. 0x 40 0x 8F 0x 91 0x 84 .. 0x 48 0x 8F 0x 8D 0x 1D 0x 8B 0x 84 .. 1984 2001 2005 1992 2005 3442 3456 31
  • 32. © 2022 Neo4j, Inc. All rights reserved. The in-memory graph - Relationships - Memory node count relationship count uncompressed values compressed values 1 bn 100 bn ~ 756 GiB ~8.2 Byte / Value [106 GiB ... 382 GiB] ~1.4 Byte / Value default for topology CompressedAdjacencyList: [106 GiB ... 382 GiB] |-- this.instance: 24 Bytes |-- pages: [95 GiB ... 371 GiB] |-- degrees: 3815 MiB |-- offsets: 7630 MiB UncompressedAdjacencyList: 756 GiB |-- this.instance: 24 Bytes |-- pages: 745 GiB |-- degrees: 3815 MiB |-- offsets: 7630 MiB used for rel properties 32
  • 33. © 2022 Neo4j, Inc. All rights reserved. The in-memory graph - Native Graph Projection ● Nodes and relationships are read via sequential scans of record stores ○ scan is IO-friendly since we touch a page only once (no random reads) ○ allows us to use a small page cache (~ page size * 100 * concurrency) ○ scales nearly linear with number of threads ● Nodes are read from record store ○ can also make use of node label index and property index if part of projection ○ threads read chunks of node records and forward them to thread-safe Id Mapping-, Label- and Property-Builders ● Relationships are read from relationship record store ○ threads read chunks of relationships records, group them by source or target, pre-compress and store partial adjacency lists in intermediate list ○ multi-threaded compression of partially compressed adjacency lists into CSR 33
  • 34. © 2022 Neo4j, Inc. All rights reserved. Native Graph Projection - Scale Testing • Imported Graph500* artificial data sets into Neo4j ◦ power-law relationship distribution, no properties • Azure M416ms, 416 vCPU, 11TB RAM ◦ Neo4j with ~9 TB Heap and ~1 TB PageCache ◦ gds.graph.project with readConcurrency: 416 *Graph 500 Benchmark specification https://graph500.org/ scale node count (2^scale) relationship count (2^(scale + 4)) Neo4j disk size [GB] 30 1,073,741,824 17,179,869,184 1113 31 2,147,483,648 34,359,738,368 2226 32 4,294,967,296 68,719,476,736 4452 33 8,589,934,592 137,438,953,472 9272 34
  • 35. © 2022 Neo4j, Inc. All rights reserved. Native Graph Projection - Scale Testing Known problem. GDS 2.2 (fall ‘22) will improve projection performance for large scale graphs. 35
  • 36. © 2022 Neo4j, Inc. All rights reserved. Native Graph Projection - Scale Testing 36
  • 37. © 2022 Neo4j, Inc. All rights reserved. Neo4j GDS under the microscope Algorithm Execution 37
  • 38. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Compute template • “Customers want their results computed as fast as possible.” ◦ Requires parallel, topology-aware algorithm execution ◦ Requires thread-safe, efficient data structures Most GDS algorithms follow the same “compute pattern” for scalability: 1. Procedure call, e.g. CALL gds.wcc.mutate(...) 2. Input validation (meta data and algo configuration) 3. Graph Catalog read (potentially label / type filtered) 4. Compute a. initialize algorithm specific data structures (e.g. Huge[Atomic]*Array) b. partition node id space based on concurrency (range or degree partitioned) c. create a task for each partition (task captures the algorithm logic) d. run tasks in parallel (optionally repeat until termination criteria is reached) 5. Process result according to execution mode (stream, mutate, write, stats) 38
  • 39. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Compute template • Step 4 “Compute” can be implemented with one of two internal APIs: • The Graph API (internal) ◦ Requires knowledge about GDS internals ◦ Most expressive API for manual optimization ◦ Often faster, but also more complex ◦ Developer needs to take care of partitioning, parallelism, thread-safety etc. • The Pregel* API (external) ◦ Implemented on top of Graph API ◦ Requires less knowledge about GDS internals ◦ Less expressive API tailored for iterative, message-passing-based algorithms ◦ Automatic parallelization and load balancing using Fork-Join execution *Malewicz et al: “Pregel: A System for Large-Scale Graph Processing”, Google 39
  • 40. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Weakly Connected Components 40
  • 41. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Weakly Connected Components Wcc::compute(Graph g, Config c) -> Components { dss = init(g.nodeCount()) // DisjointSetStruct tasks = rangePartition(g.nodeCount()).map(p -> new WccTask(g, p, dss)) executeParallel(tasks, c.concurrency()) return Components.of(dss) } class WccTask(Graph g, Partition p, DisjointSetStruct dss) { run() { p.forEachNode(node -> g.forEachRelationship(node, (s, t) -> dss.union(s, t)) ); } } 41 Implementation: https://tinyurl.com/mr2ayp75
  • 42. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Weakly Connected Components Wcc::compute(Graph g, Config c) -> Components { dss = init(g.nodeCount()) // HugeAtomicDisjointSetStruct tasks = rangePartition(g.nodeCount()).map(p -> new WccTask(g, p, dss)) executeParallel(tasks, c.concurrency()) return Components.of(dss) } class WccTask(Graph g, Partition p, DisjointSetStruct dss) { run() { p.forEachNode(node -> g.forEachRelationship(node, (s, t) -> dss.union(s, t)) ); } } // also called Union-Find class DisjointSetStruct { union(long idA, long idB); setIdOf(long id) -> long; } Input: {{0}, {42}, {3}, {6}, {54}, {1337}} dss.union(0, 42) dss.union(3, 6) dss.union(42, 1337) Output: {{0, 42, 1337}, {3, 6}, {54}} dss.setIdOf(42) // 0 dss.setIdOf(1337)// 0 dss.setIdOf(6) // 3 dss.setIdOf(54) // 54 42
  • 43. © 2022 Neo4j, Inc. All rights reserved. Implementation: https://tinyurl.com/mr2ayp75 Algorithm Execution - Weakly Connected Components Wcc::compute(Graph g, Config c) -> Components { dss = init(g.nodeCount()) // DisjointSetStruct tasks = rangePartition(g.nodeCount()).map(p -> new WccTask(g, p, dss)) executeParallel(tasks, c.concurrency()) return Components.of(dss) } class WccTask(Graph g, Partition p, DisjointSetStruct dss) { run() { p.forEachNode(node -> g.forEachRelationship(node, (s, t) -> dss.union(s, t)) ); } } Thread-safe, wait-free implementation Graph uses thread-local decompressing cursors 43
  • 44. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - WCC - Scale Testing 44
  • 45. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Page Rank 45
  • 46. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Page Rank • PageRank is implemented using the Pregel API ◦ initialization, partitioning and task execution is handled by Pregel ◦ provides tailored queue implementation for message passing ◦ developer can focus on the algorithm logic in a vertex-centric computation PageRank::compute(Messages messages, Context c) { rank = c.getValue(PAGE_RANK) if (c.superstep() > 0) { sum = messages.sum() rank = ((1 - dampingFactor) / c.nodeCount()) + dampingFactor * sum c.setValue(PAGE_RANK, rank) } c.sendToNeighbors(rank) } Implementation: https://tinyurl.com/dpyhmmkw Called in parallel for each node in each superstep Messages sent by other nodes in the previous superstep Page Rank specific logic Send new message to neighbors 46
  • 47. © 2022 Neo4j, Inc. All rights reserved. Algorithm Execution - Page Rank - Scale Testing 47
  • 48. © 2022 Neo4j, Inc. All rights reserved. Neo4j GDS under the microscope Arrow Data Import and Export 48
  • 49. © 2022 Neo4j, Inc. All rights reserved. Apache Arrow • “Customers may want their data to be projected from sources other than Neo4j” • “Customers often export algorithm results for downstream analysis” • Requires parallel, structured, compressed data streaming • Arrow is a language-independent columnar memory format for flat and hierarchical data: ◦ Data is organized for efficient analytic operations on CPU and GPU ◦ Arrow Flight is used to move Arrow data efficiently between processes and machines ◦ Written in C++ with wrappers for Java, Rust, Python, C/C#, Matlab, R, etc. Check out Dave’s talk: “Achieve Blazing-Fast Ingest Speeds with Apache Arrow” https://arrow.apache.org/ 49
  • 50. © 2022 Neo4j, Inc. All rights reserved. Graph Import via Apache Arrow • GDS Arrow Flight Server is started / stopped with the DBMS • Arrow Flight Client must follow protocol for importing data • Import phases are indicated by messages • Record batches can be sent in parallel by using multiple clients Arrow Flight Client GDS Arrow Flight Server CREATE_GRAPH “my_graph” ACK Send Nodes Flight stream NODE_LOAD_DONE “my_graph” ACK Send Relationships Flight stream RELATIONSHIP_LOAD_DONE “my_graph” ACK In-memory graph 1 2 3 50 https://neo4j.com/docs/graph-data-science/current/graph-project-apache-arrow/
  • 51. © 2022 Neo4j, Inc. All rights reserved. • GET actions mirror stream node and relationship property procedures • Output streams are produced in parallel and can be consumed in parallel Property Export via Apache Arrow GET Property Message { graph_name: “my_graph”, procedure: “gds.graph.streamNodeProperty”, config: { node_labels: [“A”, “B”], node_property: “foobar” } } 1 Send Node Property Flight stream In-memory graph Arrow Flight Client GDS Arrow Flight Server 51 https://neo4j.com/docs/graph-data-science/current/graph-catalog-apache-arrow-ops/
  • 52. © 2022 Neo4j, Inc. All rights reserved. Arrow Data Import - Scale testing • Graph 500 data set scale 32 (~4 bn nodes, ~68 bn relationships) ◦ 1 node property, 3 relationship properties ◦ Data stored in Google Storage ◦ Node files: 476 parquet files 101.7 MB each ◦ Relationship files: 50,001 parquet files 41.1 MB each • Server: GCP m1-ultramem-160 (160vCPU - 3844GB RAM) ◦ Neo4j with 3.3 TB Heap and 12 GiB PageCache • Ingest via Beam/Dataflow Configuration ◦ 68 dataflow workers (n2-standard-2) ◦ 128 concurrenct server-side graph creation threads 52
  • 53. © 2022 Neo4j, Inc. All rights reserved. Arrow Data Import - Scale testing ● GDS Graph Load in ~3.5 hours ○ Nodes: 4,294,967,296 at a rate of ~6.2 M nodes/s ○ Relationships: 68,719,476,736 at a rate of ~8.0 M relationships/s ○ Overall throughput: 5.5M objects/s Nodes (11.5m) Edges (2h 23m) Graph Creation (55m) 53
  • 54. © 2022 Neo4j, Inc. All rights reserved. Summary and Outlook 54
  • 55. © 2022 Neo4j, Inc. All rights reserved. Summary - Scalability challenges • Data size ◦ Requires a compact in-memory graph representation ◦ Compact Node Id Mappings and Compressed CSR data structure • Algorithm execution ◦ Requires parallel, topology-aware algorithm execution ◦ Low-level Graph API to achieve the optimal performance on a JVM ◦ Mid-level APIs to simplify the development of algorithms • Data import / export ◦ Requires non-blocking graph construction ◦ Parallel construction of node and relationship data structures ◦ Requires parallel, structured, compressed data streaming ◦ Apache Arrow provides a fast mechanism for data import / export 55
  • 56. © 2022 Neo4j, Inc. All rights reserved. Outlook • Property compression ◦ Node and relationship properties are currently stored uncompressed ◦ Mutate leads to many properties stored in-memory • Algorithm execution ◦ Alternative algorithm execution frameworks ◦ Off-load Machine Learning workloads to native libraries (e.g. tensorflow) • Graph representation ◦ Allow to only load parts of the graph into memory to enable large-scale computation with limited ressources • Integration ◦ Apache Arrow opens a whole world of systems to interact with ◦ Access to this world needs to be made simple for customers 56
  • 57. © 2022 Neo4j, Inc. All rights reserved. Thank you martin.junghanns@neo4j.com https://github.com/s1ck @kc1s Special thanks to Kent Stroker, Sr. Sales Engineer at Neo4j for setting up and running the GDS project and algorithm scale tests and Dave Voutila, Sales Engineering Manager at Neo4j for setting up and running the GDS Arrow scale tests