© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
1
🏹 Blazing Fast Ingestion with
Apache Arrow
Dave Voutila
Sales Engineering Manager, US East
© 2022 Neo4j, Inc. All rights reserved.
👋 Hello! • Sales Engineering Manager @ Neo4j
◦
• Based in Vermont, USA 🍁
◦ Usually chasing our 2 Golden
Retrievers 🐕🐕
• Online
◦ 👨‍💻
◦ 👨‍💼
• Outside Neo4j
◦ 🐡 (hypervisor
hacker)
◦ 🐤 on “the Twitters”
2
© 2022 Neo4j, Inc. All rights reserved.
📝 Agenda
• Dispelling Myths
• Defining the Problem Statements
• Reviewing Success Stories
• Showing where Arrow fits in the Solution
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
4
Ok, so what then? 🤔
Or: What problems does Apache Arrow solve? 🕵️
© 2022 Neo4j, Inc. All rights reserved.
Who’s this all for?
• We talk about Developers & Data Scientists…
From https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b
© 2022 Neo4j, Inc. All rights reserved.
Who’s this all for?
• But what about the Data & ML Engineers?
From https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b
© 2022 Neo4j, Inc. All rights reserved.
Problem 1: Reading Graph Features at scale is “hard”
1. 👷 Build a modestly sized graph (millions of nodes & relationships)
2. 🎆 Generate graph features
◦ Scalars
• Community IDs – fraud rings?
• Centrality Scores – influence scores?
◦ Vectors
• Node embeddings – input for ML
◦ Tuples
• Similarity tuples (node1, node2, relationship type, score) – updates to KGraph
• K-Shortest paths
3. ✈️ 🏖️ Book yourself a nice holiday while you wait for your Cypher query to
retrieve the billions of rows of data
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
8
How Users Measure Time
• 🚀 Can I use it in a REPL?
• 🍵 Should I make a coffee or tea?
• 🍕 Should I go to lunch now?
• 😴 Should I go home for the night?
• ✈️ Should I go home for the weekend?
“Real-Time”
😃
Batch
😐
Gonna Start Looking at
Jobs on LinkedIn
😭😭😭
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
9
Where are you Today?
“Real-Time”
😃
Batch
😐
Gonna Start Looking at
Jobs on LinkedIn
😭😭😭
• 🚀 Can I use it in a REPL?
• 🍵 Should I make a coffee or tea?
• 🍕 Should I go to lunch now?
• 😴 Should I go home for the night?
• ✈️ Should I go home for the weekend?
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
10
Where are you Today?
“Real-Time”
😃
Batch
😐
Gonna Start Looking at
Jobs on LinkedIn
😭😭😭
• 🚀 Can I use it in a REPL?
• 🍵 Should I make a coffee or tea?
• 🍕 Should I go to lunch now?
• 😴 Should I go home for the night?
• ✈️ Should I go home for the weekend?
© 2022 Neo4j, Inc. All rights reserved.
Problem 1: Reading Graph Features at scale is hard
Neo4j Python
GDS Client w/
Arrow
© 2022 Neo4j, Inc. All rights reserved.
Problem 2: Writing a Dataset to a Graph at scale is hard
• Neo4j Data Importer exists for a very specific reason!
◦ But how do you scale it?
◦ How do you move beyond CSV-in-the-Browser-via-Neo4j-Desktop?
• 🧑‍🔬 As Data Scientists, we…
◦ Have no file system access (sorry, neo4j-admin import and Aura!)
◦ Have large datasets…too large for drop & drop CSV
◦ Need to iterate while we experiment
• 👷 As Data Engineers, we…
◦ Need to integrate to a variety of “Data Lake” and Cloud sources
◦ Probably aren’t Neo4j & Cypher experts (yet).
© 2022 Neo4j, Inc. All rights reserved.
Problem 2: Writing a Dataset to a Graph at scale is hard
• CSVs via neo4j-admin import
◦ Graph500 Scale31: 1B Nodes, 17B Relationships
◦ No properties
◦ Total time: 8h 15m
◦ Rate: ~0.6M objects/s
🍎
© 2022 Neo4j, Inc. All rights reserved.
Problem 2: Writing a Dataset to a Graph at scale is hard
• CSVs via neo4j-admin import
◦ Graph500 Scale31: 1B Nodes, 17B Relationships
◦ No properties
◦ Total time: 8h 15m
◦ Rate: ~0.6M objects/s
• BigQuery ➡️ PyArrow ➡️ Neo4j DB (via GDS Flight Server)
◦ Social Media Event Graph: 10B Nodes, 43B Relationships
◦ Multiple node and relationship properties
◦ Total time: 7h 41m
◦ Rate: ~1.9M objects/s
◦ And this was the rate from the Source to the Graph!
🍎
🍊
© 2022 Neo4j, Inc. All rights reserved.
Problem 2: I have a Dataset and need a Graph!
• CSVs via neo4j-admin import
◦ Graph500 Scale31: 1B Nodes, 17B Relationships
◦ No properties
◦ Total time: 8h 15m
◦ Rate: ~0.6M objects/s
• BigQuery 🡪 Apache Beam + PyArrow 🡪 Neo4j (via GDS Flight Server)
◦ Twitter Event Graph: 10B Nodes, 43B Relationships
◦ Multiple properties
◦ Total time: 7h 41m
◦ Rate: ~1.9M objects/s
🍎
🍊
3x faster
10x the nodes
2x the relationships
Without creating and copying CSV files!
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
16
What is Apache Arrow? 🏹
Arrow? Arrows? What?!
© 2022 Neo4j, Inc. All rights reserved.
🏹 Apache Arrow 101
• 🛑 NOT related to https://arrows.app
• Apache project initially released ~5 years ago
• Protocol for efficient binary representation of vectorized data
◦ Implemented primarily in C++
◦ Wrappers for Java, Python, .NET, Go, JS, Ruby, Rust, MATLAB, R…
• Design Objectives
◦ Data adjacency for sequential access (scans)
◦ O(1) (constant-time) random access
◦ SIMD and vectorization-friendly
◦ Relocatable without “pointer swizzling”, allowing for true zero-copy access in
shared memory
© 2022 Neo4j, Inc. All rights reserved.
18
Example: Nullable int32 Vector
© 2022 Neo4j, Inc. All rights reserved.
19
Example: Nullable int32 Vector
© 2022 Neo4j, Inc. All rights reserved.
20
Compare to Bolt
https://7687.org/bolt/bolt-protocol-message-specification-4.html#detail-message---record
© 2022 Neo4j, Inc. All rights reserved.
21
Complex Example: Nullable List Vector
© 2022 Neo4j, Inc. All rights reserved.
22
Complex Example: Nullable List Vector
© 2022 Neo4j, Inc. All rights reserved.
23
🤔 Who’s using Apache Arrow?
• 🐼 Pandas
• Dask
• Graphistry
• MATLAB
• HuggingFace
• Databricks (Apache Spark)
• DataStax (Apache Cassandra)
• AWS Data Wrangler
• GCP BigQuery
• Dremio
23
•PySpark: IBM measured a 53x speedup in data processing by Python and Spark after
adding support for Arrow in PySpark
•Parquet and C++: Reading data into Parquet from C++ at up to 4GB/s
•Pandas: Reading into Pandas up to 10GB/s
-- from https://www.dremio.com/resources/guides/apache-arrow/
© 2022 Neo4j, Inc. All rights reserved.
✈️ Apache Arrow Flight
• 🔌 RPC & Streaming Framework built using Apache Arrow
◦ Maintained as part of the Apache Arrow project
◦ Optional dependency
• Utilizes gRPC…like the NOM Agent 🕵️
◦ Transport: HTTP/2
◦ Serialization: Protocol Buffers (aka protobufs)
• Arrow Flight : Arrow :: Apollo GraphQL : Neo4j
◦ An amazing scaffolding for rapidly building your data streaming service
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
25
Where can I find Arrow? 🏹
Or: how do I use the dang thing? 🤔
© 2022 Neo4j, Inc. All rights reserved.
🌍 Ecosystem: Tools & Integrations
✔ Neo4j 🐍 GDS Python Client
© 2022 Neo4j, Inc. All rights reserved.
🌍 Ecosystem: Tools & Integrations
✔ Neo4j 🐍 GDS Python Client
o neo4j_arrow Python module
◦ 🚧 https://github.com/neo4j-field/neo4j_arrow
◦ Originally created for GraphConnect 2022 keynote demo
© 2022 Neo4j, Inc. All rights reserved.
🌍 Ecosystem: Tools & Integrations
✔ Neo4j 🐍 GDS Python Client
o neo4j_arrow Python module
◦ 🚧 https://github.com/neo4j-field/neo4j_arrow
◦ Originally created for GraphConnect 2022 keynote demo
✔ Google Dataflow Flex Template
◦ Images for lift/shift of Parquet from GCS, tables in BigQuery
◦ Uses neo4j_arrow
◦ https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds
© 2022 Neo4j, Inc. All rights reserved.
🌍 Ecosystem: Tools & Integrations
✔ Neo4j 🐍 GDS Python Client
o neo4j_arrow Python module
◦ 🚧 https://github.com/neo4j-field/neo4j_arrow
◦ Originally created for GraphConnect 2022 keynote demo
✔ Google Dataflow Flex Template
◦ Images for lift/shift of Parquet from GCS, tables in BigQuery
◦ Uses neo4j_arrow v0
◦ https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds
• Apache Hop – 🚧 Encode/Decode transforms
© 2022 Neo4j, Inc. All rights reserved.
🚧 Apache Hop + Apache Arrow
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
Some Concrete Examples
© 2022 Neo4j, Inc. All rights reserved.
32
Aside: How does the Keynote Demo work?
• “Three Easy Pieces!”
◦ Google BigQuery
◦ JupyterLabs Python kernel
◦ Neo4j with GDS
• Parallel data streaming from front to back!
◦ All Apache Arrow vectors from end to end!
• Fanout is simple with the Python module
◦ No 3rd party libraries needed other than PyArrow & BigQuery client
◦ No need for multiple servers (though you could do it!)
© 2022 Neo4j, Inc. All rights reserved.
33
How does the Keynote Demo work?
• Works operate on BigQuery streams 🚿
Jupyter
Python
kernel
neo4j_bq
Worker
Neo4j
GDS
Worker
Worker
© 2022 Neo4j, Inc. All rights reserved.
34
Working in Apache Spark
• Instead of the Neo4j Spark Connector…
◦ you can use Apache Arrow!
• 📋 Design Considerations
◦ Spark Workers should use an Arrow Flight Client to parallelize streams
◦ Spark is very Row oriented, so not a natural fit some transforms
• 🔍 Let’s look at an example in Google DataProc
© 2022 Neo4j, Inc. All rights reserved.
35
Working in Apache Spark
• One example using the RDD Spark API
• Code
◦
The Notebook
© 2022 Neo4j, Inc. All rights reserved.
36
“Embarrassingly Parallel” with Apache Beam
• 👷 Beam pipeline Workers use the Arrow Flight Client
• A Google Dataflow test with 68 Beam Workers and a lot of data:
◦ Nodes: 4B nodes (1 prop)
• 50 GiB* (476 Parquet files)
◦ Edges: 68B edges (3 props)
• 2.2 TiB* (50,001 Parquet files)
◦ Runtime: 3.5 hours
• 5.5M graph objects/s
* this was the compressed size using snappy compression!
Neo4j CPU Utilization over Time
© 2022 Neo4j, Inc. All rights reserved.
37
“Embarrassingly Parallel” with Apache Beam
• PyArrow 💘 the Python Beam SDK
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
38
⚒️ Installing GDS Apache Arrow
© 2022 Neo4j, Inc. All rights reserved.
😱 RTFM – Docs and the Power Switch
• 🚢 Ships in GDS 2.1 and newer
• 📃 All baked into the GDS docs:
◦ https://neo4j.com/docs/graph-data-science/current/installation/installation-
apache-arrow/
• 🔛 Primary “power switch” for neo4j.conf
◦ gds.arrow.enabled=true
• Like other “connectors” (i.e. http, bolt, etc.), has additional properties
© 2022 Neo4j, Inc. All rights reserved.
🏹 Apache Arrow Flight
• The following are all under the gds.arrow property namespace
Setting Default Description
listen_address localhost:8491 Host/IP and port to bind to
advertised_listen_address localhost:8491 What the client tries to connect to
abortion_timeout 10 Max idle timeout during an import.
batch_size 10000
Only tune this if you know what you’re
doing 😉
© 2022 Neo4j, Inc. All rights reserved.
☑️ Validating the Install
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
42
🧑‍🏭 Hands-on Exercises!
👕 Roll up those sleeves…if you have them.
© 2022 Neo4j, Inc. All rights reserved.
🌄 Pre-Requisites before we Begin
• Neo4j Enterprise 4.4.x
• GDS >= 2.2.1
◦ GDS License
• Jupyter or Colab Notebook
◦ For “local” users with neo4j-field access, clone:
◦ https://github.com/neo4j-field/nodes-2022-arrow
https://docs.google.com/spreadsheets/d/1
PGfRz7eHX_GAU5VwiUmmqalkXgBcbP
Hi8ymM6d8uZWs/edit?usp=sharing
© 2022 Neo4j, Inc. All rights reserved.
🏋️ The Exercises
1. GDS Client – Projecting a Graph directly from Parquet files
◦ 📥 Getting Data In
2. GDS Client – Streaming Graph Features at Scale
◦ 📤 Getting Data Out
3. Neo4j Arrow Client – Projecting a Graph
◦ 🏹 Getting Data In via PyArrow
4. Neo4j Arrow Client – Importing a new Database (only allowed locally)
◦ 🗃️ Building a Database via PyArrow
© 2022 Neo4j, Inc. All rights reserved.
Ultimately…
• 🙈 Neither YOU, nor the USER,
should know Arrow exists!
◦ It’s like Bolt…it should disappear
into the background.
◦ You shouldn’t notice it.
• 😀 They should just be happy
with Neo4j’s data integration
performance.
© 2022 Neo4j, Inc. All rights reserved.
And be confident! 🦸
• You can build Projections of ✨BILLIONS✨ of Nodes and Relationships
• You can build Databases of ✨BILLIONS✨ of Nodes and Relationships
• And you can do the above…
◦ Without filesystem access
◦ Without CSV nonsense
◦ With elastic, scale-out concurrency (BYO-Orchestration 😉)
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
47
~~ fin ~~
Please be kind and rewind.

Road to NODES - Blazing Fast Ingest with Apache Arrow

  • 1.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 1 🏹 Blazing Fast Ingestion with Apache Arrow Dave Voutila Sales Engineering Manager, US East
  • 2.
    © 2022 Neo4j,Inc. All rights reserved. 👋 Hello! • Sales Engineering Manager @ Neo4j ◦ • Based in Vermont, USA 🍁 ◦ Usually chasing our 2 Golden Retrievers 🐕🐕 • Online ◦ 👨‍💻 ◦ 👨‍💼 • Outside Neo4j ◦ 🐡 (hypervisor hacker) ◦ 🐤 on “the Twitters” 2
  • 3.
    © 2022 Neo4j,Inc. All rights reserved. 📝 Agenda • Dispelling Myths • Defining the Problem Statements • Reviewing Success Stories • Showing where Arrow fits in the Solution
  • 4.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 4 Ok, so what then? 🤔 Or: What problems does Apache Arrow solve? 🕵️
  • 5.
    © 2022 Neo4j,Inc. All rights reserved. Who’s this all for? • We talk about Developers & Data Scientists… From https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b
  • 6.
    © 2022 Neo4j,Inc. All rights reserved. Who’s this all for? • But what about the Data & ML Engineers? From https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b
  • 7.
    © 2022 Neo4j,Inc. All rights reserved. Problem 1: Reading Graph Features at scale is “hard” 1. 👷 Build a modestly sized graph (millions of nodes & relationships) 2. 🎆 Generate graph features ◦ Scalars • Community IDs – fraud rings? • Centrality Scores – influence scores? ◦ Vectors • Node embeddings – input for ML ◦ Tuples • Similarity tuples (node1, node2, relationship type, score) – updates to KGraph • K-Shortest paths 3. ✈️ 🏖️ Book yourself a nice holiday while you wait for your Cypher query to retrieve the billions of rows of data
  • 8.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 8 How Users Measure Time • 🚀 Can I use it in a REPL? • 🍵 Should I make a coffee or tea? • 🍕 Should I go to lunch now? • 😴 Should I go home for the night? • ✈️ Should I go home for the weekend? “Real-Time” 😃 Batch 😐 Gonna Start Looking at Jobs on LinkedIn 😭😭😭
  • 9.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 9 Where are you Today? “Real-Time” 😃 Batch 😐 Gonna Start Looking at Jobs on LinkedIn 😭😭😭 • 🚀 Can I use it in a REPL? • 🍵 Should I make a coffee or tea? • 🍕 Should I go to lunch now? • 😴 Should I go home for the night? • ✈️ Should I go home for the weekend?
  • 10.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 10 Where are you Today? “Real-Time” 😃 Batch 😐 Gonna Start Looking at Jobs on LinkedIn 😭😭😭 • 🚀 Can I use it in a REPL? • 🍵 Should I make a coffee or tea? • 🍕 Should I go to lunch now? • 😴 Should I go home for the night? • ✈️ Should I go home for the weekend?
  • 11.
    © 2022 Neo4j,Inc. All rights reserved. Problem 1: Reading Graph Features at scale is hard Neo4j Python GDS Client w/ Arrow
  • 12.
    © 2022 Neo4j,Inc. All rights reserved. Problem 2: Writing a Dataset to a Graph at scale is hard • Neo4j Data Importer exists for a very specific reason! ◦ But how do you scale it? ◦ How do you move beyond CSV-in-the-Browser-via-Neo4j-Desktop? • 🧑‍🔬 As Data Scientists, we… ◦ Have no file system access (sorry, neo4j-admin import and Aura!) ◦ Have large datasets…too large for drop & drop CSV ◦ Need to iterate while we experiment • 👷 As Data Engineers, we… ◦ Need to integrate to a variety of “Data Lake” and Cloud sources ◦ Probably aren’t Neo4j & Cypher experts (yet).
  • 13.
    © 2022 Neo4j,Inc. All rights reserved. Problem 2: Writing a Dataset to a Graph at scale is hard • CSVs via neo4j-admin import ◦ Graph500 Scale31: 1B Nodes, 17B Relationships ◦ No properties ◦ Total time: 8h 15m ◦ Rate: ~0.6M objects/s 🍎
  • 14.
    © 2022 Neo4j,Inc. All rights reserved. Problem 2: Writing a Dataset to a Graph at scale is hard • CSVs via neo4j-admin import ◦ Graph500 Scale31: 1B Nodes, 17B Relationships ◦ No properties ◦ Total time: 8h 15m ◦ Rate: ~0.6M objects/s • BigQuery ➡️ PyArrow ➡️ Neo4j DB (via GDS Flight Server) ◦ Social Media Event Graph: 10B Nodes, 43B Relationships ◦ Multiple node and relationship properties ◦ Total time: 7h 41m ◦ Rate: ~1.9M objects/s ◦ And this was the rate from the Source to the Graph! 🍎 🍊
  • 15.
    © 2022 Neo4j,Inc. All rights reserved. Problem 2: I have a Dataset and need a Graph! • CSVs via neo4j-admin import ◦ Graph500 Scale31: 1B Nodes, 17B Relationships ◦ No properties ◦ Total time: 8h 15m ◦ Rate: ~0.6M objects/s • BigQuery 🡪 Apache Beam + PyArrow 🡪 Neo4j (via GDS Flight Server) ◦ Twitter Event Graph: 10B Nodes, 43B Relationships ◦ Multiple properties ◦ Total time: 7h 41m ◦ Rate: ~1.9M objects/s 🍎 🍊 3x faster 10x the nodes 2x the relationships Without creating and copying CSV files!
  • 16.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 16 What is Apache Arrow? 🏹 Arrow? Arrows? What?!
  • 17.
    © 2022 Neo4j,Inc. All rights reserved. 🏹 Apache Arrow 101 • 🛑 NOT related to https://arrows.app • Apache project initially released ~5 years ago • Protocol for efficient binary representation of vectorized data ◦ Implemented primarily in C++ ◦ Wrappers for Java, Python, .NET, Go, JS, Ruby, Rust, MATLAB, R… • Design Objectives ◦ Data adjacency for sequential access (scans) ◦ O(1) (constant-time) random access ◦ SIMD and vectorization-friendly ◦ Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory
  • 18.
    © 2022 Neo4j,Inc. All rights reserved. 18 Example: Nullable int32 Vector
  • 19.
    © 2022 Neo4j,Inc. All rights reserved. 19 Example: Nullable int32 Vector
  • 20.
    © 2022 Neo4j,Inc. All rights reserved. 20 Compare to Bolt https://7687.org/bolt/bolt-protocol-message-specification-4.html#detail-message---record
  • 21.
    © 2022 Neo4j,Inc. All rights reserved. 21 Complex Example: Nullable List Vector
  • 22.
    © 2022 Neo4j,Inc. All rights reserved. 22 Complex Example: Nullable List Vector
  • 23.
    © 2022 Neo4j,Inc. All rights reserved. 23 🤔 Who’s using Apache Arrow? • 🐼 Pandas • Dask • Graphistry • MATLAB • HuggingFace • Databricks (Apache Spark) • DataStax (Apache Cassandra) • AWS Data Wrangler • GCP BigQuery • Dremio 23 •PySpark: IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark •Parquet and C++: Reading data into Parquet from C++ at up to 4GB/s •Pandas: Reading into Pandas up to 10GB/s -- from https://www.dremio.com/resources/guides/apache-arrow/
  • 24.
    © 2022 Neo4j,Inc. All rights reserved. ✈️ Apache Arrow Flight • 🔌 RPC & Streaming Framework built using Apache Arrow ◦ Maintained as part of the Apache Arrow project ◦ Optional dependency • Utilizes gRPC…like the NOM Agent 🕵️ ◦ Transport: HTTP/2 ◦ Serialization: Protocol Buffers (aka protobufs) • Arrow Flight : Arrow :: Apollo GraphQL : Neo4j ◦ An amazing scaffolding for rapidly building your data streaming service
  • 25.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 25 Where can I find Arrow? 🏹 Or: how do I use the dang thing? 🤔
  • 26.
    © 2022 Neo4j,Inc. All rights reserved. 🌍 Ecosystem: Tools & Integrations ✔ Neo4j 🐍 GDS Python Client
  • 27.
    © 2022 Neo4j,Inc. All rights reserved. 🌍 Ecosystem: Tools & Integrations ✔ Neo4j 🐍 GDS Python Client o neo4j_arrow Python module ◦ 🚧 https://github.com/neo4j-field/neo4j_arrow ◦ Originally created for GraphConnect 2022 keynote demo
  • 28.
    © 2022 Neo4j,Inc. All rights reserved. 🌍 Ecosystem: Tools & Integrations ✔ Neo4j 🐍 GDS Python Client o neo4j_arrow Python module ◦ 🚧 https://github.com/neo4j-field/neo4j_arrow ◦ Originally created for GraphConnect 2022 keynote demo ✔ Google Dataflow Flex Template ◦ Images for lift/shift of Parquet from GCS, tables in BigQuery ◦ Uses neo4j_arrow ◦ https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds
  • 29.
    © 2022 Neo4j,Inc. All rights reserved. 🌍 Ecosystem: Tools & Integrations ✔ Neo4j 🐍 GDS Python Client o neo4j_arrow Python module ◦ 🚧 https://github.com/neo4j-field/neo4j_arrow ◦ Originally created for GraphConnect 2022 keynote demo ✔ Google Dataflow Flex Template ◦ Images for lift/shift of Parquet from GCS, tables in BigQuery ◦ Uses neo4j_arrow v0 ◦ https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds • Apache Hop – 🚧 Encode/Decode transforms
  • 30.
    © 2022 Neo4j,Inc. All rights reserved. 🚧 Apache Hop + Apache Arrow
  • 31.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. Some Concrete Examples
  • 32.
    © 2022 Neo4j,Inc. All rights reserved. 32 Aside: How does the Keynote Demo work? • “Three Easy Pieces!” ◦ Google BigQuery ◦ JupyterLabs Python kernel ◦ Neo4j with GDS • Parallel data streaming from front to back! ◦ All Apache Arrow vectors from end to end! • Fanout is simple with the Python module ◦ No 3rd party libraries needed other than PyArrow & BigQuery client ◦ No need for multiple servers (though you could do it!)
  • 33.
    © 2022 Neo4j,Inc. All rights reserved. 33 How does the Keynote Demo work? • Works operate on BigQuery streams 🚿 Jupyter Python kernel neo4j_bq Worker Neo4j GDS Worker Worker
  • 34.
    © 2022 Neo4j,Inc. All rights reserved. 34 Working in Apache Spark • Instead of the Neo4j Spark Connector… ◦ you can use Apache Arrow! • 📋 Design Considerations ◦ Spark Workers should use an Arrow Flight Client to parallelize streams ◦ Spark is very Row oriented, so not a natural fit some transforms • 🔍 Let’s look at an example in Google DataProc
  • 35.
    © 2022 Neo4j,Inc. All rights reserved. 35 Working in Apache Spark • One example using the RDD Spark API • Code ◦ The Notebook
  • 36.
    © 2022 Neo4j,Inc. All rights reserved. 36 “Embarrassingly Parallel” with Apache Beam • 👷 Beam pipeline Workers use the Arrow Flight Client • A Google Dataflow test with 68 Beam Workers and a lot of data: ◦ Nodes: 4B nodes (1 prop) • 50 GiB* (476 Parquet files) ◦ Edges: 68B edges (3 props) • 2.2 TiB* (50,001 Parquet files) ◦ Runtime: 3.5 hours • 5.5M graph objects/s * this was the compressed size using snappy compression! Neo4j CPU Utilization over Time
  • 37.
    © 2022 Neo4j,Inc. All rights reserved. 37 “Embarrassingly Parallel” with Apache Beam • PyArrow 💘 the Python Beam SDK
  • 38.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 38 ⚒️ Installing GDS Apache Arrow
  • 39.
    © 2022 Neo4j,Inc. All rights reserved. 😱 RTFM – Docs and the Power Switch • 🚢 Ships in GDS 2.1 and newer • 📃 All baked into the GDS docs: ◦ https://neo4j.com/docs/graph-data-science/current/installation/installation- apache-arrow/ • 🔛 Primary “power switch” for neo4j.conf ◦ gds.arrow.enabled=true • Like other “connectors” (i.e. http, bolt, etc.), has additional properties
  • 40.
    © 2022 Neo4j,Inc. All rights reserved. 🏹 Apache Arrow Flight • The following are all under the gds.arrow property namespace Setting Default Description listen_address localhost:8491 Host/IP and port to bind to advertised_listen_address localhost:8491 What the client tries to connect to abortion_timeout 10 Max idle timeout during an import. batch_size 10000 Only tune this if you know what you’re doing 😉
  • 41.
    © 2022 Neo4j,Inc. All rights reserved. ☑️ Validating the Install
  • 42.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 42 🧑‍🏭 Hands-on Exercises! 👕 Roll up those sleeves…if you have them.
  • 43.
    © 2022 Neo4j,Inc. All rights reserved. 🌄 Pre-Requisites before we Begin • Neo4j Enterprise 4.4.x • GDS >= 2.2.1 ◦ GDS License • Jupyter or Colab Notebook ◦ For “local” users with neo4j-field access, clone: ◦ https://github.com/neo4j-field/nodes-2022-arrow https://docs.google.com/spreadsheets/d/1 PGfRz7eHX_GAU5VwiUmmqalkXgBcbP Hi8ymM6d8uZWs/edit?usp=sharing
  • 44.
    © 2022 Neo4j,Inc. All rights reserved. 🏋️ The Exercises 1. GDS Client – Projecting a Graph directly from Parquet files ◦ 📥 Getting Data In 2. GDS Client – Streaming Graph Features at Scale ◦ 📤 Getting Data Out 3. Neo4j Arrow Client – Projecting a Graph ◦ 🏹 Getting Data In via PyArrow 4. Neo4j Arrow Client – Importing a new Database (only allowed locally) ◦ 🗃️ Building a Database via PyArrow
  • 45.
    © 2022 Neo4j,Inc. All rights reserved. Ultimately… • 🙈 Neither YOU, nor the USER, should know Arrow exists! ◦ It’s like Bolt…it should disappear into the background. ◦ You shouldn’t notice it. • 😀 They should just be happy with Neo4j’s data integration performance.
  • 46.
    © 2022 Neo4j,Inc. All rights reserved. And be confident! 🦸 • You can build Projections of ✨BILLIONS✨ of Nodes and Relationships • You can build Databases of ✨BILLIONS✨ of Nodes and Relationships • And you can do the above… ◦ Without filesystem access ◦ Without CSV nonsense ◦ With elastic, scale-out concurrency (BYO-Orchestration 😉)
  • 47.
    © 2022 Neo4j,Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 47 ~~ fin ~~ Please be kind and rewind.