Neo4j Import Webinar

Neo4j Import Webinar
Mark Needham (@markhneedham)
30th July 2015

Neo Technology, Inc Confidential
#neo4j
Chicago Crime dataset

#neo4j
Chicago Crime CSV file
imported into
The goal

#neo4j
Exploring the data

#neo4j
Exploring the data
LOAD CSV WITH HEADERS FROM
"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to
_present.csv"
AS row
RETURN row
LIMIT 1

#neo4j
Sketch a rough initial model

#neo4j
Import a sample: Crimes
_present.csv"
AS row
WITH row LIMIT 100
MERGE (crime:Crime {
id: row.ID,
description: row.Description,
caseNumber: row.`Case Number`,
arrest: row.Arrest,
domestic: row.Domestic});

#neo4j
Import a sample: Crimes
Show how to do this better by splitting up the attrib
utes

#neo4j
Import a sample: Crime Types
_present.csv"
AS row
WITH row LIMIT 100
MERGE (:CrimeType {
name: row.`Primary Type`});

#neo4j
Import a sample: Crimes -> Crime Types
_present.csv"
AS row
WITH row LIMIT 100
MATCH (crime:Crime {
id: row.ID,
description: row.Description})
MATCH (crimeType:CrimeType {
name: row.`Primary Type`})
MERGE (crime)-[:TYPE]->(crimeType);

#neo4j
Add indexes
CREATE INDEX ON :Label(property)

#neo4j
Add indexes
CREATE INDEX ON :Label(property)
CREATE INDEX ON :Crime(id);
CREATE INDEX ON :Location(name);
CREATE INDEX ON :CrimeType(name);
CREATE INDEX ON :Location(name);
...

#neo4j
Periodic Commit
USING PERIODIC COMMIT
file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_
present.csv
MERGE (crime:Crime {
id: row.ID,
description: row.Description})

#neo4j
Periodic Commit
• Neo4j keeps all transaction state in memory
which becomes problematic for large CSV files
• USING PERIODIC COMMIT flushes the
transaction after a certain number of rows
• Default is 1000 rows but it’s configurable
• Currently only works with LOAD CSV

#neo4j
Avoiding the Eager
• Cypher has an Eager operator which will bring
forward parts of a query to ensure safety
• We don’t want to see this operator when we’re
importing data – it will slow things down a lot
• Put a diagram of eager => slow (maybe a query
plan?)

#neo4j
LOAD CSV in summary
• ETL power tool
• Built into Neo4J since version 2.1
• Can load data from any URL
• Good for medium size data (up to 10M rows)

#neo4j
Bulk loading an initial data set
• Introducing the Neo4j Import Tool
• Find it in the bin folder of your Neo4j download
• Used to large sized initial data sets
• Skips the transactional layer of Neo4j and writes
store files directly

#neo4j
Expects files in a certain format
:ID(Crime) :LABEL description :ID(Beat) :LABEL
:START_ID(Crime) :END_ID(Beat) :TYPE
Nodes
Relationships

#neo4j
What we have…

#neo4j
Chicago Crime
CSV file
Neo4j ready CSV
files
Translation Phase required
Translation
Phase

#neo4j
Chicago Crime
CSV file
Spark all the things
Spark Job
processed by
spits out
Neo4j ready CSV
files
imported into

#neo4j
The Spark Job

#neo4j
Submitting the Spark Job
./spark-1.3.0-bin-hadoop1/bin/spark-submit
--driver-memory 5g
--class GenerateCSVFiles
--master local[8]
target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506s
user 8m2.183s
sys 0m24.267s

#neo4j
The generated files
$ ls -1 tmp/*.csv
tmp/beats.csv
tmp/crimeDates.csv
tmp/crimes.csv
tmp/crimesBeats.csv
tmp/crimesDates.csv
tmp/crimesLocations.csv
tmp/crimesPrimaryTypes.csv
tmp/dates.csv
tmp/locations.csv
tmp/primaryTypes.csv

#neo4j
Importing into Neo4j
DATA=tmp
NEO=./neo4j-enterprise-2.2.3
$NEO/bin/neo4j-import
--into $DATA/crimes.db
--nodes $DATA/crimes.csv
--nodes $DATA/beats.csv
--nodes $DATA/primaryTypes.csv
--nodes $DATA/locations.csv
--relationships $DATA/crimesBeats.csv
--relationships $DATA/crimesPrimaryTypes.csv
--relationships $DATA/crimesLocations.csv
--stacktrace
IMPORT DONE in 36s 208ms

#neo4j
Enriching the crime graph

#neo4j
2 options
JSON CSVjq
LOAD
CSV
JSON
Language
Driver
HTTP
API

#neo4j
Using py2neo to load JSON into Neo4j
import json
from py2neo import Graph, authenticate
authenticate("localhost:7474", "neo4j", "foobar")
graph = Graph()
with open('categories.json') as data_file:
json = json.load(data_file)
query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
MERGE (c:CrimeCategory {name: category.name})
MERGE (sc:SubCategory {code: subCategory.code})
ON CREATE SET sc.description = subCategory.description
MERGE (c)-[:CHILD]->(sc)
"""
print graph.cypher.execute(query, json = json)

#neo4j
anslate from JSON to CSV

#neo4j
Import using LOAD CSV

#neo4j
Updating the graph
• As new crimes come in we want to update the
graph to take them into account

#neo4j
Updating the graph
• Import this using REST Transactional API

#neo4j
This talk brought to you by…

#neo4j
And that’s it…

Neo4j Import Webinar

More Related Content

What's hot

Viewers also liked

Similar to Neo4j Import Webinar

More from Neo4j

Recently uploaded

Neo4j Import Webinar

Editor's Notes