Road to NODES - Blazing Fast Ingest with Apache Arrow

© 2022 Neo4j, Inc. All rights reserved.
1
🏹 Blazing Fast Ingestion with
Apache Arrow
Dave Voutila
Sales Engineering Manager, US East

👋 Hello! • Sales Engineering Manager @ Neo4j
◦
• Based in Vermont, USA 🍁
◦ Usually chasing our 2 Golden
Retrievers 🐕🐕
• Online
◦ 👨‍💻
◦ 👨‍💼
• Outside Neo4j
◦ 🐡 (hypervisor
hacker)
◦ 🐤 on “the Twitters”
2

📝 Agenda
• Dispelling Myths
• Defining the Problem Statements
• Reviewing Success Stories
• Showing where Arrow fits in the Solution

4
Ok, so what then? 🤔
Or: What problems does Apache Arrow solve? 🕵️

Who’s this all for?
• We talk about Developers & Data Scientists…
From https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b

Who’s this all for?
• But what about the Data & ML Engineers?
From https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b

Problem 1: Reading Graph Features at scale is “hard”
1. 👷 Build a modestly sized graph (millions of nodes & relationships)
2. 🎆 Generate graph features
◦ Scalars
• Community IDs – fraud rings?
• Centrality Scores – influence scores?
◦ Vectors
• Node embeddings – input for ML
◦ Tuples
• Similarity tuples (node1, node2, relationship type, score) – updates to KGraph
• K-Shortest paths
3. ✈️ 🏖️ Book yourself a nice holiday while you wait for your Cypher query to
retrieve the billions of rows of data

8
How Users Measure Time
• 🚀 Can I use it in a REPL?
• 🍵 Should I make a coffee or tea?
• 🍕 Should I go to lunch now?
• 😴 Should I go home for the night?
• ✈️ Should I go home for the weekend?
“Real-Time”
😃
Batch
😐
Gonna Start Looking at
Jobs on LinkedIn
😭😭😭

9
Where are you Today?
“Real-Time”
😃
Batch
😐
Jobs on LinkedIn
😭😭😭

10
Where are you Today?
“Real-Time”
😃
Batch
😐
Jobs on LinkedIn
😭😭😭

Problem 1: Reading Graph Features at scale is hard
Neo4j Python
GDS Client w/
Arrow

Problem 2: Writing a Dataset to a Graph at scale is hard
• Neo4j Data Importer exists for a very specific reason!
◦ But how do you scale it?
◦ How do you move beyond CSV-in-the-Browser-via-Neo4j-Desktop?
• 🧑‍🔬 As Data Scientists, we…
◦ Have no file system access (sorry, neo4j-admin import and Aura!)
◦ Have large datasets…too large for drop & drop CSV
◦ Need to iterate while we experiment
• 👷 As Data Engineers, we…
◦ Need to integrate to a variety of “Data Lake” and Cloud sources
◦ Probably aren’t Neo4j & Cypher experts (yet).

• CSVs via neo4j-admin import
◦ Graph500 Scale31: 1B Nodes, 17B Relationships
◦ No properties
◦ Total time: 8h 15m
◦ Rate: ~0.6M objects/s
🍎

◦ No properties
• BigQuery ➡️ PyArrow ➡️ Neo4j DB (via GDS Flight Server)
◦ Social Media Event Graph: 10B Nodes, 43B Relationships
◦ Multiple node and relationship properties
◦ And this was the rate from the Source to the Graph!
🍎
🍊

Problem 2: I have a Dataset and need a Graph!
◦ No properties
• BigQuery 🡪 Apache Beam + PyArrow 🡪 Neo4j (via GDS Flight Server)
◦ Twitter Event Graph: 10B Nodes, 43B Relationships
◦ Multiple properties
🍎
🍊
3x faster
10x the nodes
2x the relationships
Without creating and copying CSV files!

16
What is Apache Arrow? 🏹
Arrow? Arrows? What?!

🏹 Apache Arrow 101
• 🛑 NOT related to https://arrows.app
• Apache project initially released ~5 years ago
• Protocol for efficient binary representation of vectorized data
◦ Implemented primarily in C++
◦ Wrappers for Java, Python, .NET, Go, JS, Ruby, Rust, MATLAB, R…
• Design Objectives
◦ Data adjacency for sequential access (scans)
◦ O(1) (constant-time) random access
◦ SIMD and vectorization-friendly
◦ Relocatable without “pointer swizzling”, allowing for true zero-copy access in
shared memory

18
Example: Nullable int32 Vector

19
Example: Nullable int32 Vector

20
Compare to Bolt
https://7687.org/bolt/bolt-protocol-message-specification-4.html#detail-message---record

21
Complex Example: Nullable List Vector

22
Complex Example: Nullable List Vector

23
🤔 Who’s using Apache Arrow?
• 🐼 Pandas
• Dask
• Graphistry
• MATLAB
• HuggingFace
• Databricks (Apache Spark)
• DataStax (Apache Cassandra)
• AWS Data Wrangler
• GCP BigQuery
• Dremio
23
•PySpark: IBM measured a 53x speedup in data processing by Python and Spark after
adding support for Arrow in PySpark
•Parquet and C++: Reading data into Parquet from C++ at up to 4GB/s
•Pandas: Reading into Pandas up to 10GB/s
-- from https://www.dremio.com/resources/guides/apache-arrow/

✈️ Apache Arrow Flight
• 🔌 RPC & Streaming Framework built using Apache Arrow
◦ Maintained as part of the Apache Arrow project
◦ Optional dependency
• Utilizes gRPC…like the NOM Agent 🕵️
◦ Transport: HTTP/2
◦ Serialization: Protocol Buffers (aka protobufs)
• Arrow Flight : Arrow :: Apollo GraphQL : Neo4j
◦ An amazing scaffolding for rapidly building your data streaming service

25
Where can I find Arrow? 🏹
Or: how do I use the dang thing? 🤔

🌍 Ecosystem: Tools & Integrations
✔ Neo4j 🐍 GDS Python Client

o neo4j_arrow Python module
◦ 🚧 https://github.com/neo4j-field/neo4j_arrow
◦ Originally created for GraphConnect 2022 keynote demo

✔ Google Dataflow Flex Template
◦ Images for lift/shift of Parquet from GCS, tables in BigQuery
◦ Uses neo4j_arrow
◦ https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds

✔ Google Dataflow Flex Template
◦ Images for lift/shift of Parquet from GCS, tables in BigQuery
◦ Uses neo4j_arrow v0
◦ https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds
• Apache Hop – 🚧 Encode/Decode transforms

🚧 Apache Hop + Apache Arrow

Some Concrete Examples

32
Aside: How does the Keynote Demo work?
• “Three Easy Pieces!”
◦ Google BigQuery
◦ JupyterLabs Python kernel
◦ Neo4j with GDS
• Parallel data streaming from front to back!
◦ All Apache Arrow vectors from end to end!
• Fanout is simple with the Python module
◦ No 3rd party libraries needed other than PyArrow & BigQuery client
◦ No need for multiple servers (though you could do it!)

33
How does the Keynote Demo work?
• Works operate on BigQuery streams 🚿
Jupyter
Python
kernel
neo4j_bq
Worker
Neo4j
GDS
Worker
Worker

34
Working in Apache Spark
• Instead of the Neo4j Spark Connector…
◦ you can use Apache Arrow!
• 📋 Design Considerations
◦ Spark Workers should use an Arrow Flight Client to parallelize streams
◦ Spark is very Row oriented, so not a natural fit some transforms
• 🔍 Let’s look at an example in Google DataProc

35
Working in Apache Spark
• One example using the RDD Spark API
• Code
◦
The Notebook

36
“Embarrassingly Parallel” with Apache Beam
• 👷 Beam pipeline Workers use the Arrow Flight Client
• A Google Dataflow test with 68 Beam Workers and a lot of data:
◦ Nodes: 4B nodes (1 prop)
• 50 GiB* (476 Parquet files)
◦ Edges: 68B edges (3 props)
• 2.2 TiB* (50,001 Parquet files)
◦ Runtime: 3.5 hours
• 5.5M graph objects/s
* this was the compressed size using snappy compression!
Neo4j CPU Utilization over Time

37
“Embarrassingly Parallel” with Apache Beam
• PyArrow 💘 the Python Beam SDK

38
⚒️ Installing GDS Apache Arrow

😱 RTFM – Docs and the Power Switch
• 🚢 Ships in GDS 2.1 and newer
• 📃 All baked into the GDS docs:
◦ https://neo4j.com/docs/graph-data-science/current/installation/installation-
apache-arrow/
• 🔛 Primary “power switch” for neo4j.conf
◦ gds.arrow.enabled=true
• Like other “connectors” (i.e. http, bolt, etc.), has additional properties

🏹 Apache Arrow Flight
• The following are all under the gds.arrow property namespace
Setting Default Description
listen_address localhost:8491 Host/IP and port to bind to
advertised_listen_address localhost:8491 What the client tries to connect to
abortion_timeout 10 Max idle timeout during an import.
batch_size 10000
Only tune this if you know what you’re
doing 😉

☑️ Validating the Install

42
🧑‍🏭 Hands-on Exercises!
👕 Roll up those sleeves…if you have them.

🌄 Pre-Requisites before we Begin
• Neo4j Enterprise 4.4.x
• GDS >= 2.2.1
◦ GDS License
• Jupyter or Colab Notebook
◦ For “local” users with neo4j-field access, clone:
◦ https://github.com/neo4j-field/nodes-2022-arrow
https://docs.google.com/spreadsheets/d/1
PGfRz7eHX_GAU5VwiUmmqalkXgBcbP
Hi8ymM6d8uZWs/edit?usp=sharing

🏋️ The Exercises
1. GDS Client – Projecting a Graph directly from Parquet files
◦ 📥 Getting Data In
2. GDS Client – Streaming Graph Features at Scale
◦ 📤 Getting Data Out
3. Neo4j Arrow Client – Projecting a Graph
◦ 🏹 Getting Data In via PyArrow
4. Neo4j Arrow Client – Importing a new Database (only allowed locally)
◦ 🗃️ Building a Database via PyArrow

Ultimately…
• 🙈 Neither YOU, nor the USER,
should know Arrow exists!
◦ It’s like Bolt…it should disappear
into the background.
◦ You shouldn’t notice it.
• 😀 They should just be happy
with Neo4j’s data integration
performance.

And be confident! 🦸
• You can build Projections of ✨BILLIONS✨ of Nodes and Relationships
• You can build Databases of ✨BILLIONS✨ of Nodes and Relationships
• And you can do the above…
◦ Without filesystem access
◦ Without CSV nonsense
◦ With elastic, scale-out concurrency (BYO-Orchestration 😉)

47
~~ fin ~~
Please be kind and rewind.

Road to NODES - Blazing Fast Ingest with Apache Arrow

More Related Content

What's hot

Similar to Road to NODES - Blazing Fast Ingest with Apache Arrow

More from Neo4j

Recently uploaded

Road to NODES - Blazing Fast Ingest with Apache Arrow