Airline Reservations
and Routing: A Graph
Use Case
Jason Plurad
Chin Huang
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Pilots
2DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Jason Plurad is a software developer in IBM Digital Business Group. He
develops open source software and builds open communities in the big data
and analytics space, with a current focus on graph databases and graph
analytics. He is a Technical Steering Committee member and committer on
JanusGraph and Apache TinkerPop.
Chin Huang is a software engineer at the IBM Open Technologies and
Performance. He has worked on various enterprise and open source
projects. His current focus is JanusGraph and node.js development and
performance characterization.
How Did We Get Here?
Jason
• Raleigh (RDU)
• Detroit (DTW)
• Amsterdam (AMS)
• Berlin (TXL)
Chin
• San Francisco (SFO)
• Copenhagen (CPH)
• Berlin (TXL)
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Graphs are not new
4DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Graph Data Use Cases
5
Social network analysis
Configuration management database
Master data management
Recommendation engines
Knowledge graphs
Internet of things
Cyber security attack analysis
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
C
A
B
D
Property Graph
6DOC ID / Month XX, 2018 / © 2018 IBM Corporation
RDU DTW AMS
TXLSFO CPH
Type: vertex
Label: airport
Name: Berlin Tegel
Code: TXL
City: Berlin
Country: Germany
Type: edge
Label: route
Flight: 343
Distance: 501
Depart: 13:05
Arrive: 14:57
Gremlin: Graph Traversal
Language
7
What is the shortest path to Berlin?
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Apache TinkerPop
https://tinkerpop.apache.org
> g.V(rdu).
repeat( out('route').simplePath() ).
until( has('code’, TXL') ).
limit(5).
path().by('code').
toList()
==> [RDU, JFK, TXL]
==> [RDU, LAX, TXL]
==> [RDU, MIA, TXL]
==> [RDU, YYZ, TXL]
==> [RDU, SFO, TXL]
JanusGraph
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation 8
JanusGraph
Maintainer The Linux
Foundation
License Apache
Releases 0.3.0 planned
2Q 2018
https://janusgraph.org
• Established in January 2017
• Fork of TitanDB
• Scalable graph database distributed
on multi-machine clusters with
pluggable storage and indexing
• Vendor-neutral, open community with
open governance
• Founders: Expero, Google, Grakn,
Hortonworks, IBM
• Members: Amazon, Huawei,
Netflix, Orchestral Developments,
Seeq, Uber
• In Production: Celum, Finc, G-
Data, IBM Cloud, Seeq
JanusGraph Architecture
9DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
http://docs.janusgraph.org/latest/arch-overview.html
Graph database storage
backends: Performance
evaluation
Graph use case: Air
travel reservation
10DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Performance Test
Environment
11
Server spec
• Physical servers: x3650 M5, 2 sockets x 14
cores, 384 GB (12 x 32G) memory
• CPU: Intel Xeon Processor E5-2690 v4 14C
2.6GHz 35MB Cache 2400MHz
• Network interface: Emulex VFA5.2 ML2 Dual
Port 10GbE SFP+ Adapter
• Disk: 720 GB SSD, RAID 5
• Operating system: Ubuntu 16.04.2 LTS
Public tools
• jMeter - load testing tool
• nmon, nmon analyser - system performance
monitor and analyze tool
• VisualVM - all-in-one Java
troubleshooting/profiling tool
• GCeasy - garbage collection log analysis tool
• Prometheus and grafana – monitoring
dashboard
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
JanusGraph Utility Tools
12
How about graph data in volume?
• Lack of existing data or unavailable for performance evaluation
• What are the performance characteristics for various volumes
• Graph Data Generator generates graph data in different sizes and
shapes, so you can easily simulate real data and performance
How to manage graph schema?
• Lack of graph schema management tools
• Graph schemas may change for optimal performance
• Graph Schema Loader enables you to quickly load and update
schema definitions in JanusGraph
How to massively load data into a graph database?
• Lots of RDBMS support data export to CSV files
• I have millions/billions of records!
• Data Batch Importer allows you to fully utilize system resources to
import data in CSV files into JanusGraph
Open source code: https://github.com/IBM/janusgraph-utils
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Performance Test Topology
13DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Cassandra
HBase + HDFS
+ ZooKeeper
Scylla
Cassandra
HBase + HDFS
+ ZooKeeper
Scylla
Cassandra
HBase + HDFS
+ ZooKeeper
Scylla
JanusGraph
Database Cluster
Load injector
queryinsert, update
Performance Evaluation:
Insert Vertices
14DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
• 40 mil vertices in total
• 2 properties for each vertex
• Insert scenario
• Fully utilize the injectors to generate the
loading against the databases
Performance Evaluation:
Insert Edges
15DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
• 30 mil edges in total
• 1 property for each edge
• Query and update scenario
Performance
Evaluation: Graph
Traversal
16DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Lessons Learned: Storage
Backends
17
Cassandra
• Cluster bootstrapping takes more efforts
• Smaller memory footprint
HBase
• Uneven CPU% caused by hot regions
• Need to carefully configure read and write
cache settings for better throughput
Scylla
• Easy clustering – adding multiple nodes at once
• Well self-tuned but also lacks documentation
• Even load distributed
• Fully utilize system resources
• CPU utilization misrepresents real loads
• Nice monitoring dashboard – prometheus +
grafana
• Works with existing Cassandra utility clients
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Flight Search Use Case
18
Flight search
•All flights from airport A to airport B on a given date and time
•# of stops: non-stop, one-stop, two-stop…
Data spec
•600+ airports, 350K+ flight schedules
Graph Model
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
Vertex: Airport
Airport code
Vertex: Country
Country name
Edge: Flight Schedule
Flight #
Departure date
Arrival date
Lessons Learned: Flight
Search
19
Model your graph database for performance
• Design data model for your use cases!
• Understand workload read/write ratio
• What kind of queries you want to support? How
many levels deep into a traversal?
• Consider denormalization…
• Design and use various indexes supported in
JanusGraph
Try different approaches to get results back faster
• Use pre-processor in custom app
• Use gremlin queries, applying filters as early as
possible in a query to limit the number of
traversals
• Use groovy methods as programmable extension
Fine-tune for your workloads and systems
• JanusGraph supports storage and index backends
therefore tune your backends!
• JanusGraph server configurations, such as
threadPoolBoss and threadPoolWorker
• JVM configurations, such as Xms (initial and
minimum Java heap size) and Xmx (maximum
Java heap size) You don’t want to see the
annoying java.lang.OutOfMemoryError exceptions
or long and slower GCs.
• Use multiple threads and/or instances to your
system’s capacity
• Consider cloud and auto-scaling
• Be thorough and be patient because it will take a
few iterations!
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
20
Thank you
compose.com/databases/janusgraph
twitter.com/pluradj
twitter.com/chinhuang007
github.com/IBM/janusgraph-utils
developer.ibm.com/code/patterns
DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
21DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation

Airline Reservations and Routing: A Graph Use Case

  • 1.
    Airline Reservations and Routing:A Graph Use Case Jason Plurad Chin Huang DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 2.
    Pilots 2DataWorks Summit Berlin/ April 18, 2018 / © 2018 IBM Corporation Jason Plurad is a software developer in IBM Digital Business Group. He develops open source software and builds open communities in the big data and analytics space, with a current focus on graph databases and graph analytics. He is a Technical Steering Committee member and committer on JanusGraph and Apache TinkerPop. Chin Huang is a software engineer at the IBM Open Technologies and Performance. He has worked on various enterprise and open source projects. His current focus is JanusGraph and node.js development and performance characterization.
  • 3.
    How Did WeGet Here? Jason • Raleigh (RDU) • Detroit (DTW) • Amsterdam (AMS) • Berlin (TXL) Chin • San Francisco (SFO) • Copenhagen (CPH) • Berlin (TXL) DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 4.
    Graphs are notnew 4DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 5.
    Graph Data UseCases 5 Social network analysis Configuration management database Master data management Recommendation engines Knowledge graphs Internet of things Cyber security attack analysis DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation C A B D
  • 6.
    Property Graph 6DOC ID/ Month XX, 2018 / © 2018 IBM Corporation RDU DTW AMS TXLSFO CPH Type: vertex Label: airport Name: Berlin Tegel Code: TXL City: Berlin Country: Germany Type: edge Label: route Flight: 343 Distance: 501 Depart: 13:05 Arrive: 14:57
  • 7.
    Gremlin: Graph Traversal Language 7 Whatis the shortest path to Berlin? DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation Apache TinkerPop https://tinkerpop.apache.org > g.V(rdu). repeat( out('route').simplePath() ). until( has('code’, TXL') ). limit(5). path().by('code'). toList() ==> [RDU, JFK, TXL] ==> [RDU, LAX, TXL] ==> [RDU, MIA, TXL] ==> [RDU, YYZ, TXL] ==> [RDU, SFO, TXL]
  • 8.
    JanusGraph DataWorks Summit Berlin/ April 18, 2018 / © 2018 IBM Corporation 8 JanusGraph Maintainer The Linux Foundation License Apache Releases 0.3.0 planned 2Q 2018 https://janusgraph.org • Established in January 2017 • Fork of TitanDB • Scalable graph database distributed on multi-machine clusters with pluggable storage and indexing • Vendor-neutral, open community with open governance • Founders: Expero, Google, Grakn, Hortonworks, IBM • Members: Amazon, Huawei, Netflix, Orchestral Developments, Seeq, Uber • In Production: Celum, Finc, G- Data, IBM Cloud, Seeq
  • 9.
    JanusGraph Architecture 9DataWorks SummitBerlin / April 18, 2018 / © 2018 IBM Corporation http://docs.janusgraph.org/latest/arch-overview.html
  • 10.
    Graph database storage backends:Performance evaluation Graph use case: Air travel reservation 10DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 11.
    Performance Test Environment 11 Server spec •Physical servers: x3650 M5, 2 sockets x 14 cores, 384 GB (12 x 32G) memory • CPU: Intel Xeon Processor E5-2690 v4 14C 2.6GHz 35MB Cache 2400MHz • Network interface: Emulex VFA5.2 ML2 Dual Port 10GbE SFP+ Adapter • Disk: 720 GB SSD, RAID 5 • Operating system: Ubuntu 16.04.2 LTS Public tools • jMeter - load testing tool • nmon, nmon analyser - system performance monitor and analyze tool • VisualVM - all-in-one Java troubleshooting/profiling tool • GCeasy - garbage collection log analysis tool • Prometheus and grafana – monitoring dashboard DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 12.
    JanusGraph Utility Tools 12 Howabout graph data in volume? • Lack of existing data or unavailable for performance evaluation • What are the performance characteristics for various volumes • Graph Data Generator generates graph data in different sizes and shapes, so you can easily simulate real data and performance How to manage graph schema? • Lack of graph schema management tools • Graph schemas may change for optimal performance • Graph Schema Loader enables you to quickly load and update schema definitions in JanusGraph How to massively load data into a graph database? • Lots of RDBMS support data export to CSV files • I have millions/billions of records! • Data Batch Importer allows you to fully utilize system resources to import data in CSV files into JanusGraph Open source code: https://github.com/IBM/janusgraph-utils DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 13.
    Performance Test Topology 13DataWorksSummit Berlin / April 18, 2018 / © 2018 IBM Corporation Cassandra HBase + HDFS + ZooKeeper Scylla Cassandra HBase + HDFS + ZooKeeper Scylla Cassandra HBase + HDFS + ZooKeeper Scylla JanusGraph Database Cluster Load injector queryinsert, update
  • 14.
    Performance Evaluation: Insert Vertices 14DataWorksSummit Berlin / April 18, 2018 / © 2018 IBM Corporation • 40 mil vertices in total • 2 properties for each vertex • Insert scenario • Fully utilize the injectors to generate the loading against the databases
  • 15.
    Performance Evaluation: Insert Edges 15DataWorksSummit Berlin / April 18, 2018 / © 2018 IBM Corporation • 30 mil edges in total • 1 property for each edge • Query and update scenario
  • 16.
    Performance Evaluation: Graph Traversal 16DataWorks SummitBerlin / April 18, 2018 / © 2018 IBM Corporation
  • 17.
    Lessons Learned: Storage Backends 17 Cassandra •Cluster bootstrapping takes more efforts • Smaller memory footprint HBase • Uneven CPU% caused by hot regions • Need to carefully configure read and write cache settings for better throughput Scylla • Easy clustering – adding multiple nodes at once • Well self-tuned but also lacks documentation • Even load distributed • Fully utilize system resources • CPU utilization misrepresents real loads • Nice monitoring dashboard – prometheus + grafana • Works with existing Cassandra utility clients DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 18.
    Flight Search UseCase 18 Flight search •All flights from airport A to airport B on a given date and time •# of stops: non-stop, one-stop, two-stop… Data spec •600+ airports, 350K+ flight schedules Graph Model DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation Vertex: Airport Airport code Vertex: Country Country name Edge: Flight Schedule Flight # Departure date Arrival date
  • 19.
    Lessons Learned: Flight Search 19 Modelyour graph database for performance • Design data model for your use cases! • Understand workload read/write ratio • What kind of queries you want to support? How many levels deep into a traversal? • Consider denormalization… • Design and use various indexes supported in JanusGraph Try different approaches to get results back faster • Use pre-processor in custom app • Use gremlin queries, applying filters as early as possible in a query to limit the number of traversals • Use groovy methods as programmable extension Fine-tune for your workloads and systems • JanusGraph supports storage and index backends therefore tune your backends! • JanusGraph server configurations, such as threadPoolBoss and threadPoolWorker • JVM configurations, such as Xms (initial and minimum Java heap size) and Xmx (maximum Java heap size) You don’t want to see the annoying java.lang.OutOfMemoryError exceptions or long and slower GCs. • Use multiple threads and/or instances to your system’s capacity • Consider cloud and auto-scaling • Be thorough and be patient because it will take a few iterations! DataWorks Summit Berlin / April 18, 2018 / © 2018 IBM Corporation
  • 20.
  • 21.
    21DataWorks Summit Berlin/ April 18, 2018 / © 2018 IBM Corporation