Titan and Cassandra at
WellAware
Ted Wilmes
tedwilmes@wellaware.us
Topics
● The property graph model
● The graph ecosystem
● Titan overview
● Titan at WellAware
Property graph model
label: truck
license: ABC123
year: 2013
label: person
firstName: Susan
label: company
name: Acme
Trucks
owns
bought: 2012
employs
hired: 2012
drives
Sampling of graph projects/vendors
Apache Tinkerpop - tying it all together!
● Gremlin Server
○ Remote access to Tinkerpop
compliant graph dbs for JVM
& non-JVM clients
● Gremlin
○ Graph query and processing
language
● Core API
○ add vertex
○ add edge
○ add/update properties
○ simple queries (adjacent
edge/vertex retrieval)
http://tinkerpop.incubator.apache.org/
Gremlin in action
vehicle
license: ABC123
year: 2013
person
firstName: Susan
company
name: Acme
Trucks
person
firstName: Tom
employs
hired: 2014
employs
hired: 2012
owns
bought: 2012
● Add vertices and
edges
● Retrieving vertices
● Basic vertex filtering
● Querying adjacent
edges and vertices
drives
Building the graph
// Add vertices
graph = TitanFactory.open('conf/titan-cassandra.properties')
acmeTrucks = graph.addVertex(T.label, "company", "name", "Acme Trucks")
susan = graph.addVertex(T.label, "person", "firstName", "Susan")
tom = graph.addVertex(T.label, "person", "firstName", "Tom")
truck = graph.addVertex(T.label, "vehicle", "license", "ABC123", "year", 2012)
// Connect vertices with edges
edge = acmeTrucks.addEdge("owns", truck)
edge.property("bought", 2012)
acmeTrucks.addEdge("employs", susan).property("hired", 2012)
acmeTrucks.addEdge("employs", tom).property("hired", 2014)
tom.addEdge("drives", truck)
Retrieving vertices
// Get a traverser so that we can run some queries
g = graph.traversal(standard())
gremlin> g.V()
==>v[0]
==>v[2]
==>v[4]
==>v[6]
// Get the properties for each vertex
gremlin> g.V().valueMap()
==>[name:[Acme Trucks]]
==>[firstName:[Susan]]
==>[firstName:[Tom]]
==>[license:[ABC123], year:[2012]]
Basic vertex filtering
// Retrieve all people with firstName Susan
gremlin> g.V().hasLabel("person").has("firstName", "Susan")
==>v[2]
// Retrieve all people with firstName Susan or Tom
gremlin> g.V().hasLabel("person").has("firstName", within("Susan", "Tom"))
==>v[2]
==>v[4]
Querying adjacent edges and vertices
// Count how many people Acme Trucks employs
gremlin> g.V().hasLabel("company").has("name", "Acme Trucks").out("employs").count()
==>2
// How many employees were hired in 2012?
gremlin> g.V().hasLabel("person").where(inE("employs").has("hired", 2012)).count()
==>1
// Which employees drives a truck?
gremlin> g.V().hasLabel("company").has("name", "Acme Trucks").out("employs").as("driver").out("drives").select
("driver").values("firstName")
==>Tom
// Show me all of the drivers that were hired before 2015
gremlin> g.V().hasLabel("person").and(inE("employs").values("hired").is(lt(2015)), out("drives")).values("firstName")
==>Tom
Many more steps...
● AddEdge Step
● AddVertex Step
● AddProperty Step
● Aggregate Step
● And Step
● As Step
● By Step
● Cap Step
● Coalesce Step
● Count Step
● Choose Step
● Coin Step
● CyclicPath Step
● Dedup Step
● Drop Step
● Fold Step
● Group Step
● GroupCount Step
● Has Step
● Inject Step
● Is Step
● Limit Step
● Local Step
● Match Step
● ...
GraphComputer for global graph processing
● Use cases
○ full graph traversal
○ parallel processing
○ batch import/export
● Examples
○ PageRank
○ vertex count
○ mass schema update
● Gremlin OLAP implementations
○ Hadoop
○ Spark
○ Giraph
Graph use cases
● Social network analysis
● Fraud detection
● Recommendation systems
● Route optimization
● IoT
● Master data management
TitanDB
● What is Titan?
● Data store options
● Deployment options
● Titan Cassandra data model
● Titan specific graph features
TitanDB
● Graph layer that can use a variety of data stores as backends depending
on user requirements
○ HBase
○ Berkeley DB
○ Cassandra
○ Insert your favorite k/v, BigTable data store
Which data store is right for you?
● Things to think about
○ data volume
○ CAP
○ ACID
○ read/write requirements
○ ops implications
○ your current infrastructure
http://s3.thinkaurelius.com/docs/titan/0.5.4/benefits.html
Socket
JVM
Node
JVM
Node
Embedded
JVM
A Titan cluster with access options
Titan
C*
Titan
C*
Titan
C*
Titan
C*
Titan
C*
● Access options
○ Titan < 0.9
■ Rexster
■ dependency of your app
○ Titan 0.9+
■ Gremlin server
■ dependency of your app
○ Object to graph mapper
■ Python - Mogwai, Bulbs
■ JVM - Totorom, Frames
● Titan does not need to be on each
node, all communication between
Titan instances is through C*
Titan installation
● Download and unzip latest milestone
● Cassandra footprint
○ Titan keyspace
○ Column families
■ edgestore
■ edgestore_lock_
■ graphindex
■ graphindex_lock_
■ titan_ids
■ ...
./bin/titan.sh start
Forking Cassandra...
Running `nodetool statusthrift`.. OK (returned exit
status 0 and printed string "running").
Forking Elasticsearch...
Connecting to Elasticsearch (127.0.0.1:9300). OK
(connected to 127.0.0.1:9300).
Forking Gremlin-Server...
Connecting to Gremlin-Server (127.0.0.1:8182)...... OK
(connected to 127.0.0.1:8182).
Run gremlin.sh to connect.
Vertex and edge storage format
Cassandra
Thrift
Titan storage
format
Edge and property serialization
Schema definition
● Properties
○ data type - string, float, char, geoshape, etc.
○ cardinality - single, list, set
○ uniqueness (through Titan’s indexing system)
● Edges
○ labels
○ define multiplicity - one-to-one, many-to-one, one-to-many
● Vertices
○ labels
● Advanced
○ edge, vertex, and property TTL
○ Multi-properties - properties on properties (audit info for example)
Global indexing options
● Supports composite keys
● Titan indexing provider
○ fast!
○ exact matches only
● External providers
○ Not as fast
○ Many options beyond exact
matching (wildcards,
geosearch, etc.)
○ providers
■ Elastic Search
■ Lucene
■ Solr
I want that one!
Vertex Centric Indices
● Adjacent edge counts can grow
quite large in certain situations
and form super nodes
● Supports composite keys and
ordering of edges to speed up
vertex centric queries
○ translates into slice queries of
the edges
○ efficiently retrieve ranges of
edges or satisfy top n type
queries
company
name: Acme
Trucks
employs
hired: 2013
employs
hired: 2014
employs
hired: 2015
Graph partitioning with ByteOrderedPartioner
?
Vertex cuts
Supernode ...
1
2,000,000
mgmt = graph.openManagement()
mgmt.makeVertexLabel('user').partition().make()
mgmt.commit()
}
}
Edge cuts
Company A
Company B
@
A bit more about WellAware
● Founded in 2012
● Full stack oil & gas monitoring solution
● iOS, Android, and web clients
● Connecting to field assets over RPMA, cellular, and
satellite
Functionality and high level architecture
● Remote data collection
● Mobile data collection
● Asset control
● Derived measurements
● Alarming
● Reporting
Poller Django
Titan
WAN ESB
Moving to Titan
● 2013
○ Running Django against PostgreSQL and for awhile, TempoDB
● Beginning of 2014 - started using Titan 0.4.4 to capture relationships
between assets and for derived measurements
● March 2014 - deployed a 3 node Cassandra cluster and moved the rest of
the backend (minus auth) over to Titan 0.4.4
● Today - 3 node DC for OLTP & 2 node reporting DC
○ still on Titan 0.4.4, waiting for Titan 1.0 to be released and hardened
○ post Titan 1.0, we’re looking forward to trying out DSE Graph
A common well pad configuration
Well & pumpjack
Tanks
Sample of model
O&G
Co.
TankSite
Top
Gauge
Zooming in on a well pad
wellmeter separator
meter
tank
tank
compressor
Lessons learned
● No native integration with 3rd party BI tools - reports, dashboards, ad hoc
query
○ Apache Calcite based jdbc driver that translates SQL to graph queries
● Colocation of Titan, some of your application code, and Cassandra on the
same nodes, what’s the right separation?
● Out of the box framework support is lacking (no native Spring, Dropwizard
support)
● Performance tuning requires knowledge of Titan AND Cassandra
● Play to Cassandra and adjacency list storage format strengths
● You can’t hide from tombstones!!!
Graph and Titan resources
● Tinkerpop docs - http://www.tinkerpop.com/docs/3.0.0.M6/
● Titan docs - http://s3.thinkaurelius.com/docs/titan/0.9.0-M2/
● Titan Google group - https://groups.google.com/forum/#!
forum/aureliusgraphs
● Gremlin Google group - https://groups.google.com/forum/#!forum/gremlin-
users
● O’Reilly graph ebook (focuses on Neo4j but has generally applicable graph
info) - http://graphdatabases.com/
● Java OGM - https://github.com/BrynCooke/totorom
● Python OGM - https://mogwai.readthedocs.org/en/latest/
Thanks and what questions do you have?

Titan and Cassandra at WellAware

  • 1.
    Titan and Cassandraat WellAware Ted Wilmes tedwilmes@wellaware.us
  • 2.
    Topics ● The propertygraph model ● The graph ecosystem ● Titan overview ● Titan at WellAware
  • 3.
    Property graph model label:truck license: ABC123 year: 2013 label: person firstName: Susan label: company name: Acme Trucks owns bought: 2012 employs hired: 2012 drives
  • 4.
    Sampling of graphprojects/vendors
  • 5.
    Apache Tinkerpop -tying it all together! ● Gremlin Server ○ Remote access to Tinkerpop compliant graph dbs for JVM & non-JVM clients ● Gremlin ○ Graph query and processing language ● Core API ○ add vertex ○ add edge ○ add/update properties ○ simple queries (adjacent edge/vertex retrieval) http://tinkerpop.incubator.apache.org/
  • 6.
    Gremlin in action vehicle license:ABC123 year: 2013 person firstName: Susan company name: Acme Trucks person firstName: Tom employs hired: 2014 employs hired: 2012 owns bought: 2012 ● Add vertices and edges ● Retrieving vertices ● Basic vertex filtering ● Querying adjacent edges and vertices drives
  • 7.
    Building the graph //Add vertices graph = TitanFactory.open('conf/titan-cassandra.properties') acmeTrucks = graph.addVertex(T.label, "company", "name", "Acme Trucks") susan = graph.addVertex(T.label, "person", "firstName", "Susan") tom = graph.addVertex(T.label, "person", "firstName", "Tom") truck = graph.addVertex(T.label, "vehicle", "license", "ABC123", "year", 2012) // Connect vertices with edges edge = acmeTrucks.addEdge("owns", truck) edge.property("bought", 2012) acmeTrucks.addEdge("employs", susan).property("hired", 2012) acmeTrucks.addEdge("employs", tom).property("hired", 2014) tom.addEdge("drives", truck)
  • 8.
    Retrieving vertices // Geta traverser so that we can run some queries g = graph.traversal(standard()) gremlin> g.V() ==>v[0] ==>v[2] ==>v[4] ==>v[6] // Get the properties for each vertex gremlin> g.V().valueMap() ==>[name:[Acme Trucks]] ==>[firstName:[Susan]] ==>[firstName:[Tom]] ==>[license:[ABC123], year:[2012]]
  • 9.
    Basic vertex filtering //Retrieve all people with firstName Susan gremlin> g.V().hasLabel("person").has("firstName", "Susan") ==>v[2] // Retrieve all people with firstName Susan or Tom gremlin> g.V().hasLabel("person").has("firstName", within("Susan", "Tom")) ==>v[2] ==>v[4]
  • 10.
    Querying adjacent edgesand vertices // Count how many people Acme Trucks employs gremlin> g.V().hasLabel("company").has("name", "Acme Trucks").out("employs").count() ==>2 // How many employees were hired in 2012? gremlin> g.V().hasLabel("person").where(inE("employs").has("hired", 2012)).count() ==>1 // Which employees drives a truck? gremlin> g.V().hasLabel("company").has("name", "Acme Trucks").out("employs").as("driver").out("drives").select ("driver").values("firstName") ==>Tom // Show me all of the drivers that were hired before 2015 gremlin> g.V().hasLabel("person").and(inE("employs").values("hired").is(lt(2015)), out("drives")).values("firstName") ==>Tom
  • 11.
    Many more steps... ●AddEdge Step ● AddVertex Step ● AddProperty Step ● Aggregate Step ● And Step ● As Step ● By Step ● Cap Step ● Coalesce Step ● Count Step ● Choose Step ● Coin Step ● CyclicPath Step ● Dedup Step ● Drop Step ● Fold Step ● Group Step ● GroupCount Step ● Has Step ● Inject Step ● Is Step ● Limit Step ● Local Step ● Match Step ● ...
  • 12.
    GraphComputer for globalgraph processing ● Use cases ○ full graph traversal ○ parallel processing ○ batch import/export ● Examples ○ PageRank ○ vertex count ○ mass schema update ● Gremlin OLAP implementations ○ Hadoop ○ Spark ○ Giraph
  • 13.
    Graph use cases ●Social network analysis ● Fraud detection ● Recommendation systems ● Route optimization ● IoT ● Master data management
  • 15.
    TitanDB ● What isTitan? ● Data store options ● Deployment options ● Titan Cassandra data model ● Titan specific graph features
  • 16.
    TitanDB ● Graph layerthat can use a variety of data stores as backends depending on user requirements ○ HBase ○ Berkeley DB ○ Cassandra ○ Insert your favorite k/v, BigTable data store
  • 17.
    Which data storeis right for you? ● Things to think about ○ data volume ○ CAP ○ ACID ○ read/write requirements ○ ops implications ○ your current infrastructure http://s3.thinkaurelius.com/docs/titan/0.5.4/benefits.html
  • 18.
  • 19.
    A Titan clusterwith access options Titan C* Titan C* Titan C* Titan C* Titan C* ● Access options ○ Titan < 0.9 ■ Rexster ■ dependency of your app ○ Titan 0.9+ ■ Gremlin server ■ dependency of your app ○ Object to graph mapper ■ Python - Mogwai, Bulbs ■ JVM - Totorom, Frames ● Titan does not need to be on each node, all communication between Titan instances is through C*
  • 20.
    Titan installation ● Downloadand unzip latest milestone ● Cassandra footprint ○ Titan keyspace ○ Column families ■ edgestore ■ edgestore_lock_ ■ graphindex ■ graphindex_lock_ ■ titan_ids ■ ... ./bin/titan.sh start Forking Cassandra... Running `nodetool statusthrift`.. OK (returned exit status 0 and printed string "running"). Forking Elasticsearch... Connecting to Elasticsearch (127.0.0.1:9300). OK (connected to 127.0.0.1:9300). Forking Gremlin-Server... Connecting to Gremlin-Server (127.0.0.1:8182)...... OK (connected to 127.0.0.1:8182). Run gremlin.sh to connect.
  • 21.
    Vertex and edgestorage format Cassandra Thrift Titan storage format
  • 22.
    Edge and propertyserialization
  • 23.
    Schema definition ● Properties ○data type - string, float, char, geoshape, etc. ○ cardinality - single, list, set ○ uniqueness (through Titan’s indexing system) ● Edges ○ labels ○ define multiplicity - one-to-one, many-to-one, one-to-many ● Vertices ○ labels ● Advanced ○ edge, vertex, and property TTL ○ Multi-properties - properties on properties (audit info for example)
  • 24.
    Global indexing options ●Supports composite keys ● Titan indexing provider ○ fast! ○ exact matches only ● External providers ○ Not as fast ○ Many options beyond exact matching (wildcards, geosearch, etc.) ○ providers ■ Elastic Search ■ Lucene ■ Solr I want that one!
  • 25.
    Vertex Centric Indices ●Adjacent edge counts can grow quite large in certain situations and form super nodes ● Supports composite keys and ordering of edges to speed up vertex centric queries ○ translates into slice queries of the edges ○ efficiently retrieve ranges of edges or satisfy top n type queries company name: Acme Trucks employs hired: 2013 employs hired: 2014 employs hired: 2015
  • 26.
    Graph partitioning withByteOrderedPartioner ?
  • 27.
    Vertex cuts Supernode ... 1 2,000,000 mgmt= graph.openManagement() mgmt.makeVertexLabel('user').partition().make() mgmt.commit() } }
  • 28.
  • 29.
  • 30.
    A bit moreabout WellAware ● Founded in 2012 ● Full stack oil & gas monitoring solution ● iOS, Android, and web clients ● Connecting to field assets over RPMA, cellular, and satellite
  • 31.
    Functionality and highlevel architecture ● Remote data collection ● Mobile data collection ● Asset control ● Derived measurements ● Alarming ● Reporting Poller Django Titan WAN ESB
  • 32.
    Moving to Titan ●2013 ○ Running Django against PostgreSQL and for awhile, TempoDB ● Beginning of 2014 - started using Titan 0.4.4 to capture relationships between assets and for derived measurements ● March 2014 - deployed a 3 node Cassandra cluster and moved the rest of the backend (minus auth) over to Titan 0.4.4 ● Today - 3 node DC for OLTP & 2 node reporting DC ○ still on Titan 0.4.4, waiting for Titan 1.0 to be released and hardened ○ post Titan 1.0, we’re looking forward to trying out DSE Graph
  • 33.
    A common wellpad configuration Well & pumpjack Tanks
  • 34.
  • 35.
    Zooming in ona well pad wellmeter separator meter tank tank compressor
  • 36.
    Lessons learned ● Nonative integration with 3rd party BI tools - reports, dashboards, ad hoc query ○ Apache Calcite based jdbc driver that translates SQL to graph queries ● Colocation of Titan, some of your application code, and Cassandra on the same nodes, what’s the right separation? ● Out of the box framework support is lacking (no native Spring, Dropwizard support) ● Performance tuning requires knowledge of Titan AND Cassandra ● Play to Cassandra and adjacency list storage format strengths ● You can’t hide from tombstones!!!
  • 37.
    Graph and Titanresources ● Tinkerpop docs - http://www.tinkerpop.com/docs/3.0.0.M6/ ● Titan docs - http://s3.thinkaurelius.com/docs/titan/0.9.0-M2/ ● Titan Google group - https://groups.google.com/forum/#! forum/aureliusgraphs ● Gremlin Google group - https://groups.google.com/forum/#!forum/gremlin- users ● O’Reilly graph ebook (focuses on Neo4j but has generally applicable graph info) - http://graphdatabases.com/ ● Java OGM - https://github.com/BrynCooke/totorom ● Python OGM - https://mogwai.readthedocs.org/en/latest/
  • 38.
    Thanks and whatquestions do you have?