Wanderu: Lessons
Learned
Lessons Learned and Unlearned from Building a Travel
Site with Graphs and Neo4j
Eddy Wong
CTO, Wanderu.com
@eddywongch
About Wanderu.com
Search Engine for (Intercity) Buses and Trains
Demo
From pt A to pt B
A: Boston B: DC
NYC
Nomenclature: Stations,Trips
Amtrak, $101, 09/26/2013
Bolt, $25, 09/26/2013 Mega, $24, 09/26/2013
From pt A to pt B
B: Brooklyn, NY
A: Cambridge, MA 31st & 9th Ave, NYC
South Station, Boston
28st & 7th Ave, NYC
34st & 8th Ave, NYC
Our Story
• Tech Started about 1+ yr ago
• Beta in Mar, Launch in Aug
• Knew nothing about Neo4j when we
started (Jun 2012)
• Did not like the relational model: wanted
schema-less and no self-joins
• Wanted a graph model
Relational vs. Graph
Lessons
Learned
UnLearned
Idea
•Architectural
•Modeling
•Geo
Architectural
Lessons
Art: MC Escher
Our Story
• Started with MongoDB as a general store:
easy to manipulate and organize data
• Wanted a db that could preserve the
Graph Model
• Debated: Document vs. Graph
• Could not find one single db that could do
both: general store + graph
Workflow
Store
Scraping JSON
Bus Websites Non-uniform
Data
Uniform
Data
Server
noSQL
• You need to make a choice of one noSQL
database
• You need ONE (centralized) database
• The word “database” is a loaded term
• Lots of (very diff) noSQL dbs options
Our Situation
• Data is written only in one direction
• Users search for paths, then segments
• Searches are done by date
• Needed online capability
• Trip info (price/avail) could change on some
Our Solution
• Use Both: MongoDB + Neo4j
• “Docugraph” = Document + Graph
• Syncing two kinds of databases
• Eventual consistency
Pipeline
Scraping JSON
Bus Websites Non-uniform
Data
Uniform
Data
MongoDBNeo4j
Mongo
Conn
Nodes & Edges
Replica
Mechanism
MongoConnector
• MongoDB Lab project, open source, unsupported
• Uses Replica Mechanism: Oplog
• Eventually Consistent (not real time)
• Written in Python
• Main methods: Upserts and Deletes, passes doc
• Implement DocMgr->Neo4jDocMgr->py2neo
• Other impls: MongoDocMgr, SolrDocMgr,
ESDocMgr
Populating Neo4j (2)
• Created our own way of creating Edges
• Auto Node creation when Edge is created:
Could add Stations (nodes) on the fly
• py2neo requires 2 “node ref”s to create an
edge, ie. might need two round trips to
Neo4j
Edge Creator P-code
hashtable allStations = load_stations
w_create_edge (station_id a, station_id b, otherdata)
look_up a in allStations
If found -> ref_a = allStations.get(a)
If not found ->
ref_a = py2neo.create_node(a)
Add a to allStations
...
py2neo.create_edge(ref_a, ref_b, ...)
Pipeline
Scraping JSON
Bus Websites Non-uniform
Data
MongoDB
Neo4j
Mongo
ConnNodes & Edges
Replica
Mechanism
REST
Server
BOS, NYC
BOS, PHL
NYC, DC
NYC, PHL
Modeling
Lessons
Art: MC Escher
Our Story
• We tried to “dump” all data into Neo4j
• Stations -> Nodes,Trips -> Edges
• Problem: Edges had dates -> too many
Edges -> “Super Node”
• Query perf was terrible (1+ mins) and
worse as # edges increased
Our Story (2)
• Went from Cypher to Gremlin, thinking
that would have improve performance
• Needed range queries on Edges
Our Solution
• Don’t store everything in the Neo4j, only
metadata
• Use Neo4j as an index
• Don’t store entities in Nodes, only keys
• Don’t store heavy properties in Edges
Neo4j Model
source:Tobias Lindaaker, Wes Freeman
Neo4j RuntimeModel
• Relationships are in a linked list
• Properties are in a linked list
• Therefore:There is NO random access for
Relationships or Properties
• A range query of relationships required a
full scan
Our Solution (2)
• Needed ability to do range queries on
Edges
• Serve paths from Neo4j, segments from
MongoDB
• The one thing we tried to avoid we ended
up doing: Joins
• Came up with “Docugraph” approach
Docugraph
• MongoDB Collections for Nodes and Edges
• Neo4j: Only keys for nodes
• Neo4j: Only Properties relevant for queries
Nodes & Edges
• Collection for Stations (nodes)
{id: “BOS”, name: “Boston South
Station”, address: “Summer
St”, ...}
• Collection for Trips (edges)
{depart_id: “BOS”, arrive_id:
“NYC”, carrier: “Megabus”, price:
24.0, ...}
Modeling
• Storing info in two or more dbs
• Doing a “join” across multiple dbs
Joins across DBs
MongoDB: Stations Neo4j: Nodes
BOS BOS
NYC NYC
DC DC
... ...
MongoDB: Trips Neo4j: Edges
BOS-NYC BOS-NYC
BOS-DC BOS-DC
NYC-DC NYC-DC
... ...
• Forget seq id
generated by dbs
• Use a human-created
long string for id
• Convert pair into id:
depart-arrive
• For example: BOS-
NYC
Indexing Technique
• Index Trips by {origin-dest, datetime}
Querying
• REST API in node.js
• Assemble results from two sources
• Paths from Neo4j
• Segments from MongoDB
• Sort by price, duration
Geo Lessons
Art: MC Escher
Our Story
• Wanted to mix public transport data with
intercity data
• Did not want to host all public transport
data
• Created a hybrid solution
Our Solution
• Hybrid:
• Google
Autocomplete
• Google Maps
• In house station geo
lookup
Geo
• Neo4j geo func was not out of the box
• Requires jar install
• Run a Java program to index
• Needed better doc
• Ended up using MongoDB geo instead
• Make geo func out of the box
Conclusions
• Even with a join across dbs -> solution
better than relational
• 10s paths x 100s segments vs. 500k x 500k
• Glad to have picked Neo4j: doing content
gen and more geo features now
• Graph model will be useful for future
analytics->Big Data
Useful Links
• Neo4j Internals
slideshare.net/thobe/an-overview-of-neo4j-internals
• Aseem’s Lessons Learned with Neo4j
http://aseemk.com/talks/neo4j-lessons-learned#/14
• Wes Freeman, Neo4j Internals
http://wes.skeweredrook.com/graphdb-meetup-may-2013.pdf
• MongoConnector
blog.mongodb.org/post/29127828146/introducing-mongo-connector

Wanderu - Lessons from Building a Travel Site with Neo4j

  • 1.
    Wanderu: Lessons Learned Lessons Learnedand Unlearned from Building a Travel Site with Graphs and Neo4j Eddy Wong CTO, Wanderu.com @eddywongch
  • 2.
    About Wanderu.com Search Enginefor (Intercity) Buses and Trains
  • 3.
  • 4.
    From pt Ato pt B A: Boston B: DC NYC Nomenclature: Stations,Trips Amtrak, $101, 09/26/2013 Bolt, $25, 09/26/2013 Mega, $24, 09/26/2013
  • 5.
    From pt Ato pt B B: Brooklyn, NY A: Cambridge, MA 31st & 9th Ave, NYC South Station, Boston 28st & 7th Ave, NYC 34st & 8th Ave, NYC
  • 6.
    Our Story • TechStarted about 1+ yr ago • Beta in Mar, Launch in Aug • Knew nothing about Neo4j when we started (Jun 2012) • Did not like the relational model: wanted schema-less and no self-joins • Wanted a graph model
  • 7.
  • 8.
  • 9.
  • 10.
    Our Story • Startedwith MongoDB as a general store: easy to manipulate and organize data • Wanted a db that could preserve the Graph Model • Debated: Document vs. Graph • Could not find one single db that could do both: general store + graph
  • 11.
    Workflow Store Scraping JSON Bus WebsitesNon-uniform Data Uniform Data Server
  • 12.
    noSQL • You needto make a choice of one noSQL database • You need ONE (centralized) database • The word “database” is a loaded term • Lots of (very diff) noSQL dbs options
  • 13.
    Our Situation • Datais written only in one direction • Users search for paths, then segments • Searches are done by date • Needed online capability • Trip info (price/avail) could change on some
  • 14.
    Our Solution • UseBoth: MongoDB + Neo4j • “Docugraph” = Document + Graph • Syncing two kinds of databases • Eventual consistency
  • 15.
    Pipeline Scraping JSON Bus WebsitesNon-uniform Data Uniform Data MongoDBNeo4j Mongo Conn Nodes & Edges Replica Mechanism
  • 16.
    MongoConnector • MongoDB Labproject, open source, unsupported • Uses Replica Mechanism: Oplog • Eventually Consistent (not real time) • Written in Python • Main methods: Upserts and Deletes, passes doc • Implement DocMgr->Neo4jDocMgr->py2neo • Other impls: MongoDocMgr, SolrDocMgr, ESDocMgr
  • 17.
    Populating Neo4j (2) •Created our own way of creating Edges • Auto Node creation when Edge is created: Could add Stations (nodes) on the fly • py2neo requires 2 “node ref”s to create an edge, ie. might need two round trips to Neo4j
  • 18.
    Edge Creator P-code hashtableallStations = load_stations w_create_edge (station_id a, station_id b, otherdata) look_up a in allStations If found -> ref_a = allStations.get(a) If not found -> ref_a = py2neo.create_node(a) Add a to allStations ... py2neo.create_edge(ref_a, ref_b, ...)
  • 19.
    Pipeline Scraping JSON Bus WebsitesNon-uniform Data MongoDB Neo4j Mongo ConnNodes & Edges Replica Mechanism REST Server BOS, NYC BOS, PHL NYC, DC NYC, PHL
  • 20.
  • 21.
    Our Story • Wetried to “dump” all data into Neo4j • Stations -> Nodes,Trips -> Edges • Problem: Edges had dates -> too many Edges -> “Super Node” • Query perf was terrible (1+ mins) and worse as # edges increased
  • 22.
    Our Story (2) •Went from Cypher to Gremlin, thinking that would have improve performance • Needed range queries on Edges
  • 23.
    Our Solution • Don’tstore everything in the Neo4j, only metadata • Use Neo4j as an index • Don’t store entities in Nodes, only keys • Don’t store heavy properties in Edges
  • 24.
  • 25.
    Neo4j RuntimeModel • Relationshipsare in a linked list • Properties are in a linked list • Therefore:There is NO random access for Relationships or Properties • A range query of relationships required a full scan
  • 26.
    Our Solution (2) •Needed ability to do range queries on Edges • Serve paths from Neo4j, segments from MongoDB • The one thing we tried to avoid we ended up doing: Joins • Came up with “Docugraph” approach
  • 27.
    Docugraph • MongoDB Collectionsfor Nodes and Edges • Neo4j: Only keys for nodes • Neo4j: Only Properties relevant for queries
  • 28.
    Nodes & Edges •Collection for Stations (nodes) {id: “BOS”, name: “Boston South Station”, address: “Summer St”, ...} • Collection for Trips (edges) {depart_id: “BOS”, arrive_id: “NYC”, carrier: “Megabus”, price: 24.0, ...}
  • 29.
    Modeling • Storing infoin two or more dbs • Doing a “join” across multiple dbs
  • 30.
    Joins across DBs MongoDB:Stations Neo4j: Nodes BOS BOS NYC NYC DC DC ... ... MongoDB: Trips Neo4j: Edges BOS-NYC BOS-NYC BOS-DC BOS-DC NYC-DC NYC-DC ... ... • Forget seq id generated by dbs • Use a human-created long string for id • Convert pair into id: depart-arrive • For example: BOS- NYC
  • 31.
    Indexing Technique • IndexTrips by {origin-dest, datetime}
  • 32.
    Querying • REST APIin node.js • Assemble results from two sources • Paths from Neo4j • Segments from MongoDB • Sort by price, duration
  • 33.
  • 34.
    Our Story • Wantedto mix public transport data with intercity data • Did not want to host all public transport data • Created a hybrid solution
  • 35.
    Our Solution • Hybrid: •Google Autocomplete • Google Maps • In house station geo lookup
  • 36.
    Geo • Neo4j geofunc was not out of the box • Requires jar install • Run a Java program to index • Needed better doc • Ended up using MongoDB geo instead • Make geo func out of the box
  • 37.
    Conclusions • Even witha join across dbs -> solution better than relational • 10s paths x 100s segments vs. 500k x 500k • Glad to have picked Neo4j: doing content gen and more geo features now • Graph model will be useful for future analytics->Big Data
  • 38.
    Useful Links • Neo4jInternals slideshare.net/thobe/an-overview-of-neo4j-internals • Aseem’s Lessons Learned with Neo4j http://aseemk.com/talks/neo4j-lessons-learned#/14 • Wes Freeman, Neo4j Internals http://wes.skeweredrook.com/graphdb-meetup-may-2013.pdf • MongoConnector blog.mongodb.org/post/29127828146/introducing-mongo-connector