GRAPH DATABASES: THE
SOLUTION FOR STORING
SEMI-STRUCTURED BIG DATA
Mohamed
Taher
Alrefaie
DATA IS
GETTING
BIGGER“Every two days, we
create as much
information as we
did us to 2003”. Eric
Schmidt, former
Google CEO, 2010.
DATA IS
MORE
CONNECTEDHaving a look at the
following proves it:
- Facebook Graph
- LinkedIn Graph
- Linked Data
- Blogs/Tagging
DATA IS LESS STRUCTURED
Modelling FB
Graph?
Persons,
friendships,
photos, locations,
apps, pages, ads,
interests, age
range, etc.
NOSQL DATABASES
Four types of
databases that
alleviate the
performance
issues of
relational
databases
KEY VALUE STORES
Data Model:
 Global key-value mapping
 Big scalable HashMap
 Highly fault tolerant (typically)
Examples:
 Redis, Riak, Voldemort. Dynamo
KEY VALUE STORES: PROS AND
CONS
Pros:
Simple data model
Scalable
Cons
Create your own “foreign keys”
Poor for complex data
COLUMN FAMILY
Main idea is based on BigTable: Google’s
distributed storage model for Structured Data
Data Model:
A big table, with column families
Map Reduce for querying/processing
Examples:
 HBase, HyperTable, Cassandra
COLUMN FAMILY: PROS AND CONS
Pros:
Supports Semi-Structured Data
Naturally Indexed (columns)
Scalable
Cons
Poor for interconnected data
DOCUMENT DATABASES
Data Model:
A collection of documents
A document is a key value collection
Index-centric, uses map-reduce extensively
Examples:
 CouchDB, MongoDB
DOCUMENT DATABASES: PROS AND
CONS
Pros:
Simple, powerful data model
Scalable
Cons
Poor for interconnected data
Query model limited to keys and indexes
Map reduce for larger queries
GRAPH DATABASES
Data Model:
Nodes and Relationships
Examples:
 Titan, Neo4j, OrientDB, etc.
GRAPH DATABASES: PROS AND
CONS
Pros:
Powerful data model, as general as RDBMS
Connected data locally indexed
Easy to query
Cons
Sharding
Requires different data modelling
RDBMS
LIVING IN A NOSQL WORLD
Complexity
BigTable
Clones
Size
Key-Value
Store
Document
Databases
Graph
Databases
90% of
Use Cases
Relational
Databases
9,223,372,036,854,775,807
WHAT IS A GRAPH?
An abstract representation of a set of objects where
some pairs are connected by links.
Object (Vertex, Node)
Link (Edge, Arc,
Relationship)
WHAT IS A GRAPH DATABASE?
A database with an explicit graph structure
Each node knows its adjacent nodes through edges
As the number of nodes increases, the cost of a local
step (or hop) remains the same plus an Index for
lookups
APACHE TINKERPOP: A UNIFIED API
Dealing with such
complex databases,
requires a well-
implemented API by the
vendor. But using a
vendor specific API,
makes migrating to
another database
impossible.
The solution is provided
by Apache Tinkerpop.
WHAT IS APACHE TINKERPOP?
● A Graph processing system
● Currently under Apache incubation ( 2015 )
● Has Tinkerpop3 Structure API
● Graph, Element, Property
● Has Tinkerpop3 Process API
● TraversalSource, GraphComputer
● Gremlin query language
● A scripting language for graph traversal and mutation
● REST API
WHY APACHE TINKERPOP?
Tinkerpop is a generic API for graph databases
Think ODBC, JDBC or Hibernate for relational
databases
Integrates with:
Titan DB
Neo4j
Orient DB
And many more.
Uses Gremlin graph scripting language
TITAN DATABASE
Titan is a scalable graph database using Tinkerpop
APIs optimized for storing and querying graphs
containing hundreds of billions of vertices and edges
distributed across a multi-machine cluster.
Supports Apache Spark and Hadoop (implicitly) for
map-reduce operations.
Integrates with:
 Elasticsearch, Solr, Lucene
Uses as a backend storage:
 Apache Cassandra
 Apache Hbase
PUTTING IT ALL TOGETHER
Apache Tinkerpop API
Gremlin server Graph traversal Gremlin client Monitoring
Titan DB
Storage specific (Cassandra, HBase, BerkeleyDB)
TITAN: EXAMPLE
Download titan server and console here
 https://github.com/thinkaurelius/titan/wiki/Downloads
$ cd titan-1.0.0-hadoop1
$ bin/gremlin.sh
gremlin> graph=TitanFactory.open(“conf/titan-berkely-
es.properties”)
gremlin> g=GraphOfGodsFactory.load(graph).traversal()
TINKERPOP: EXAMPLE
Graph g = TinkerGraph.open(); (1)
Vertex marko = g.addVertex(Element.ID, 1, "name", "marko", "age", 29); (2)
Vertex vadas = g.addVertex(Element.ID, 2, "name", "vadas", "age", 27);
Vertex lop = g.addVertex(Element.ID, 3, "name", "lop", "lang", "java");
Vertex josh = g.addVertex(Element.ID, 4, "name", "josh", "age", 32);
Vertex ripple = g.addVertex(Element.ID, 5, "name", "ripple", "lang", "java");
Vertex peter = g.addVertex(Element.ID, 6, "name", "peter", "age", 35);
marko.addEdge("knows", vadas, Element.ID, 7, "weight", 0.5f); (3)
marko.addEdge("knows", josh, Element.ID, 8, "weight", 1.0f);
marko.addEdge("created", lop, Element.ID, 9, "weight", 0.4f);
josh.addEdge("created", ripple, Element.ID, 10, "weight", 1.0f);
josh.addEdge("created", lop, Element.ID, 11, "weight", 0.4f);
peter.addEdge("created", lop, Element.ID, 12, "weight", 0.2f);
TINKERPOP: EXAMPLE (CONT.)
gremlin> g.V().has('name','marko')
.out('knows')
.values('name') (3)
==>vadas
==>josh
SUMMARY
Graph databases are the solution for highly scalable
semi-structured connected data.
Apache Tinkerpop is a generic API for graph databases
to avoid DB vendor specific business logic code.
Titan DB is a scalable distributed graph database on
top of several other databases. It uses BerkeleyDB,
HBase or BerkeleyDB as an end storage. This helps the
database to be as linear or scalable you want it to be.
REFERENCES
http://www.slideshare.net/maxdemarzi/introduction-to-graph-
databases-12735789
http://www.slideshare.net/mikejf12/an-introduction-to-apache-
tinkerpop
http://www.tinkerpop.com
http://tinkerpop.incubator.apache.org
http://tinkerpop.incubator.apache.org/docs/3.0.0.M9-
incubating/#gremlin-console
http://www.titandb.io
MOHAMED TAHER
ALREFAIE
07/12/2015

Graph databases: Tinkerpop and Titan DB

  • 1.
    GRAPH DATABASES: THE SOLUTIONFOR STORING SEMI-STRUCTURED BIG DATA Mohamed Taher Alrefaie
  • 2.
    DATA IS GETTING BIGGER“Every twodays, we create as much information as we did us to 2003”. Eric Schmidt, former Google CEO, 2010.
  • 3.
    DATA IS MORE CONNECTEDHaving alook at the following proves it: - Facebook Graph - LinkedIn Graph - Linked Data - Blogs/Tagging
  • 4.
    DATA IS LESSSTRUCTURED Modelling FB Graph? Persons, friendships, photos, locations, apps, pages, ads, interests, age range, etc.
  • 5.
    NOSQL DATABASES Four typesof databases that alleviate the performance issues of relational databases
  • 6.
    KEY VALUE STORES DataModel:  Global key-value mapping  Big scalable HashMap  Highly fault tolerant (typically) Examples:  Redis, Riak, Voldemort. Dynamo
  • 7.
    KEY VALUE STORES:PROS AND CONS Pros: Simple data model Scalable Cons Create your own “foreign keys” Poor for complex data
  • 8.
    COLUMN FAMILY Main ideais based on BigTable: Google’s distributed storage model for Structured Data Data Model: A big table, with column families Map Reduce for querying/processing Examples:  HBase, HyperTable, Cassandra
  • 9.
    COLUMN FAMILY: PROSAND CONS Pros: Supports Semi-Structured Data Naturally Indexed (columns) Scalable Cons Poor for interconnected data
  • 10.
    DOCUMENT DATABASES Data Model: Acollection of documents A document is a key value collection Index-centric, uses map-reduce extensively Examples:  CouchDB, MongoDB
  • 11.
    DOCUMENT DATABASES: PROSAND CONS Pros: Simple, powerful data model Scalable Cons Poor for interconnected data Query model limited to keys and indexes Map reduce for larger queries
  • 12.
    GRAPH DATABASES Data Model: Nodesand Relationships Examples:  Titan, Neo4j, OrientDB, etc.
  • 13.
    GRAPH DATABASES: PROSAND CONS Pros: Powerful data model, as general as RDBMS Connected data locally indexed Easy to query Cons Sharding Requires different data modelling
  • 14.
    RDBMS LIVING IN ANOSQL WORLD Complexity BigTable Clones Size Key-Value Store Document Databases Graph Databases 90% of Use Cases Relational Databases 9,223,372,036,854,775,807
  • 15.
    WHAT IS AGRAPH? An abstract representation of a set of objects where some pairs are connected by links. Object (Vertex, Node) Link (Edge, Arc, Relationship)
  • 16.
    WHAT IS AGRAPH DATABASE? A database with an explicit graph structure Each node knows its adjacent nodes through edges As the number of nodes increases, the cost of a local step (or hop) remains the same plus an Index for lookups
  • 17.
    APACHE TINKERPOP: AUNIFIED API Dealing with such complex databases, requires a well- implemented API by the vendor. But using a vendor specific API, makes migrating to another database impossible. The solution is provided by Apache Tinkerpop.
  • 18.
    WHAT IS APACHETINKERPOP? ● A Graph processing system ● Currently under Apache incubation ( 2015 ) ● Has Tinkerpop3 Structure API ● Graph, Element, Property ● Has Tinkerpop3 Process API ● TraversalSource, GraphComputer ● Gremlin query language ● A scripting language for graph traversal and mutation ● REST API
  • 19.
    WHY APACHE TINKERPOP? Tinkerpopis a generic API for graph databases Think ODBC, JDBC or Hibernate for relational databases Integrates with: Titan DB Neo4j Orient DB And many more. Uses Gremlin graph scripting language
  • 20.
    TITAN DATABASE Titan isa scalable graph database using Tinkerpop APIs optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Supports Apache Spark and Hadoop (implicitly) for map-reduce operations. Integrates with:  Elasticsearch, Solr, Lucene Uses as a backend storage:  Apache Cassandra  Apache Hbase
  • 21.
    PUTTING IT ALLTOGETHER Apache Tinkerpop API Gremlin server Graph traversal Gremlin client Monitoring Titan DB Storage specific (Cassandra, HBase, BerkeleyDB)
  • 22.
    TITAN: EXAMPLE Download titanserver and console here  https://github.com/thinkaurelius/titan/wiki/Downloads $ cd titan-1.0.0-hadoop1 $ bin/gremlin.sh gremlin> graph=TitanFactory.open(“conf/titan-berkely- es.properties”) gremlin> g=GraphOfGodsFactory.load(graph).traversal()
  • 23.
    TINKERPOP: EXAMPLE Graph g= TinkerGraph.open(); (1) Vertex marko = g.addVertex(Element.ID, 1, "name", "marko", "age", 29); (2) Vertex vadas = g.addVertex(Element.ID, 2, "name", "vadas", "age", 27); Vertex lop = g.addVertex(Element.ID, 3, "name", "lop", "lang", "java"); Vertex josh = g.addVertex(Element.ID, 4, "name", "josh", "age", 32); Vertex ripple = g.addVertex(Element.ID, 5, "name", "ripple", "lang", "java"); Vertex peter = g.addVertex(Element.ID, 6, "name", "peter", "age", 35); marko.addEdge("knows", vadas, Element.ID, 7, "weight", 0.5f); (3) marko.addEdge("knows", josh, Element.ID, 8, "weight", 1.0f); marko.addEdge("created", lop, Element.ID, 9, "weight", 0.4f); josh.addEdge("created", ripple, Element.ID, 10, "weight", 1.0f); josh.addEdge("created", lop, Element.ID, 11, "weight", 0.4f); peter.addEdge("created", lop, Element.ID, 12, "weight", 0.2f);
  • 24.
    TINKERPOP: EXAMPLE (CONT.) gremlin>g.V().has('name','marko') .out('knows') .values('name') (3) ==>vadas ==>josh
  • 25.
    SUMMARY Graph databases arethe solution for highly scalable semi-structured connected data. Apache Tinkerpop is a generic API for graph databases to avoid DB vendor specific business logic code. Titan DB is a scalable distributed graph database on top of several other databases. It uses BerkeleyDB, HBase or BerkeleyDB as an end storage. This helps the database to be as linear or scalable you want it to be.
  • 26.
  • 27.