Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large Scale Graph Analytics with
DataWorks Summit San Jose
June 13, 2017
P. Taylor Goetz, Hortonworks
@ptgoetz
About Me
• Tech Staff @ Hortonworks
• TSC Member, JanusGraph
• PMC Chair, Apache Storm
• ASF Member
• PMC: Apache Incubato...
What is a Graph Database?
–Wikipedia
“In computing, a graph database is a database that uses graph
structures for semantic queries with nodes, edges...
Graph Structures - Vertices
• Vertices are the nodes or points in
a graph structure
Graph Structures - Vertices
• Vertices are the nodes or points in a
graph structure
• Vertices can be associated with a
se...
Graph Structures - Edges
• Edges are the connections
between the vertices in a graph
Graph Structures - Edges
• Edges are the connections
between the vertices in a graph
• Edges can be non-directional,
direc...
Graph Structures - Edges
• Edges are the connections
between the vertices in a graph
• Edges can be non-directional,
direc...
Graph Structures - Graph
• The graph is the collection of
vertices, edges, and associated
properties
G = (V, E)
What is a Graph Database?
• A graph database is a datastore
optimized for storing and querying
graph structures
• Distinct...
Common Use Cases
Anywhere relationship modeling and analysis can provide insight or value.
Social Media
Master Data Management
Common Use Cases
• Social Networks
• Master Data Management
• Fraud Detection
• Cybersecurity
• Identity and Access Manage...
Common Use Cases
• Social Networks
• Master Data Management
• Fraud Detection
• Cybersecurity
• Identity and Access Manage...
The Power of Relationships
The Power of Relationships
• Harness the value of interconnectedness
• “Paths to Insight”
• Traversal vs. Traditional Quer...
A little history and the
importance of OSS licensing.
Titan DB
• Large scale graph db developed by Aurelius
• Licensed under ALv2 (this is important)
• Aurelius acquired by Dat...
GitHub Contributions to Titan
DataStax Aurelius
Acquisition Feb. 2015
GitHub Contributions to Titan
DataStax Aurelius
Acquisition Feb. 2015
0.9.0-M2
Jun. 9, 2015
GitHub Contributions to Titan
DataStax Aurelius
Acquisition Feb. 2015
0.9.0-M2
Jun. 9, 2015
1.0
Sept. 19, 2015
GitHub Contributions to Titan
DataStax Aurelius
Acquisition Feb. 2015
0.9.0-M2
Jun. 9, 2015
1.0
Sept. 19, 2015
Where does that leave
community, users?
ALv2 to the
Rescue!
Empowering Communities
ALv2 to the
Rescue!
Empowering Communities
“We can do this. What’s the next step?”
“Apache Olympian?”
What is a “hostile fork?”
A "hostile fork" is a fork of a project that goes against the wishes of the
copyright holders an...
–DataStax counsel on Apache Incubator mailing list
“DataStax does not approve of and objects to the proposed forking
of Ti...
“Apache Olympian?”
Next stop…
Introducing…
• Spearheaded by Google, IBM,
Hortonworks, Expero, GRAKN.AI
• Contributors from Netflix, Amazon,
Uber, Orches...
Introducing…
• ALv2 License
• Apache style governance model
• Source code, issues hosted on
GitHub
• Mailing lists on Goog...
Technical Dive
• Optimized for storing/querying billions of vertices and edges
• Supports thousands of concurrent users
• Can execute loc...
Apache Tinkerpop
• THE framework and API for graph manipulation and
traversal
• Open source, vendor agnostic
• Supported b...
Gremlin Query Language
• DSL for graph traversal and manipulation
• Fluent style API
• Multi-language support (Java, Scala...
OLAP Integration
• Apache Hadoop
• Apache Spark
• Apache Giraph
• ACID compliant (depending on backend)
• Supports very many concurrent transactions
• Embedded, Single Node, or Scale out
JanusGraph Architectural Overview
Storage Backends
• Well defined storage API allows for easily
pluggable implementations
• Choose the backend best for your...
Choose Your Own [CAP] Adventure
Consistency
Availability
Partition
Tolerance
Apache
HBase
Berkeley DB
Apache
Cassandra
Scy...
JanusGraph External Indices
• Secondary to primary graph storage
• Provide a means to speed up graph traversal
and informa...
Graph Indices
• Global index structures across entire graph
• Efficient retrieval of vertices and edges based on
associate...
Vertex-Centric Indexes
• Local index structures built per-vertex
• Eliminates the need to load all vertices from the
graph...
Pluggable Index Backends
• Elastic Search
• Apache Solr
• Apache Lucene
Schema and Data Modeling
• Consist of edge labels, property keys, vertex labels
• Explicit or Implicit
• Can evolve over t...
Schema - Edge Label Multiplicity
• MULTI: Multiple edges of the same label between vertices
• SIMPLE: One edge with that l...
Schema - Property Key Data Types
Schema - Property Key Cardinality
• SINGLE: At most one value per element.
• LIST: Arbitrary number of values per element....
• Gremlin console:
• Groovy-based REPL for exploring the graph
• Pre-defined convenience variables, expandable by plugins....
,,,/
(o o)
-----oOOo-(3)-oOOo-----
09:12:24 INFO org.apache.tinkerpop.gremlin.hadoop.structure.H
plugin activated: tinkerp...
What path will
we be taking
today?
“Graph of the Gods”
Who is Hercules’
grandfather?
gremlin>
g
gremlin>
Global variable representing
the entire graph
g.V()
Select all vertices in the graph
gremlin>
g.V().has('name', ‘hercules')
Find the vertex that has a ‘name’
Property with the value of ‘hercules’
gremlin>
g.V().has('name', ‘hercules')
.out(‘father')
Follow outbound edge named ‘father’
to the connected vertex
gremlin>
g.V().has('name', ‘hercules')
.out(‘father')
.out(‘father')
Follow outbound edge named ‘father’
to the connected vertex
gr...
g.V().has('name', ‘hercules')
.out(‘father')
.out(‘father')
.values('name')
Select the vertex property ‘name’
gremlin>
g.V().has('name', ‘hercules')
.out(‘father')
.out(‘father')
.values('name')
Select the vertex property ‘name’
gremlin>
g.V().has('name', ‘hercules')
.out(‘father')
.out(‘father')
.values('name')
gremlin>
==> saturn
What’s in a version number?
1.1
Unreleased
0.1.1
May 16, 2017
Contributions Welcome!
• Website: http://janusgraph.org
• GitHub Organization: https://github.com/JanusGraph
• User Mailin...
Thank you!
Questions?
P. Taylor Goetz, Hortonworks
@ptgoetz
Upcoming SlideShare
Loading in …5
×

Large Scale Graph Analytics with JanusGraph

9,964 views

Published on

Slides from my DataWorks Summit presentation on JanusGraph

Published in: Technology

Large Scale Graph Analytics with JanusGraph

  1. 1. Large Scale Graph Analytics with DataWorks Summit San Jose June 13, 2017 P. Taylor Goetz, Hortonworks @ptgoetz
  2. 2. About Me • Tech Staff @ Hortonworks • TSC Member, JanusGraph • PMC Chair, Apache Storm • ASF Member • PMC: Apache Incubator, Apache Arrow, Apache Kylin, Apache Apex, Apache Eagle, Apache Metron
  3. 3. What is a Graph Database?
  4. 4. –Wikipedia “In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with one operation.”
  5. 5. Graph Structures - Vertices • Vertices are the nodes or points in a graph structure
  6. 6. Graph Structures - Vertices • Vertices are the nodes or points in a graph structure • Vertices can be associated with a set of properties (key-value pairs)
  7. 7. Graph Structures - Edges • Edges are the connections between the vertices in a graph
  8. 8. Graph Structures - Edges • Edges are the connections between the vertices in a graph • Edges can be non-directional, directional, or bi-directional
  9. 9. Graph Structures - Edges • Edges are the connections between the vertices in a graph • Edges can be non-directional, directional, or bi-directional • Edges can be named and like vertices can have properties
  10. 10. Graph Structures - Graph • The graph is the collection of vertices, edges, and associated properties G = (V, E)
  11. 11. What is a Graph Database? • A graph database is a datastore optimized for storing and querying graph structures • Distinct from relational databases • Focus in terms of storage and queries is on relationships
  12. 12. Common Use Cases Anywhere relationship modeling and analysis can provide insight or value.
  13. 13. Social Media
  14. 14. Master Data Management
  15. 15. Common Use Cases • Social Networks • Master Data Management • Fraud Detection • Cybersecurity • Identity and Access Management • Recommendation Engines
  16. 16. Common Use Cases • Social Networks • Master Data Management • Fraud Detection • Cybersecurity • Identity and Access Management • Recommendation Engines Many of these can overlap and be combined to provide new insights.
  17. 17. The Power of Relationships
  18. 18. The Power of Relationships • Harness the value of interconnectedness • “Paths to Insight” • Traversal vs. Traditional Query: Join Reduction • “If you can whiteboard it, you can graph it.”
  19. 19. A little history and the importance of OSS licensing.
  20. 20. Titan DB • Large scale graph db developed by Aurelius • Licensed under ALv2 (this is important) • Aurelius acquired by DataStax Feb. 2015 • 1.0 released Sept. 19, 2015
  21. 21. GitHub Contributions to Titan DataStax Aurelius Acquisition Feb. 2015
  22. 22. GitHub Contributions to Titan DataStax Aurelius Acquisition Feb. 2015 0.9.0-M2 Jun. 9, 2015
  23. 23. GitHub Contributions to Titan DataStax Aurelius Acquisition Feb. 2015 0.9.0-M2 Jun. 9, 2015 1.0 Sept. 19, 2015
  24. 24. GitHub Contributions to Titan DataStax Aurelius Acquisition Feb. 2015 0.9.0-M2 Jun. 9, 2015 1.0 Sept. 19, 2015
  25. 25. Where does that leave community, users?
  26. 26. ALv2 to the Rescue! Empowering Communities
  27. 27. ALv2 to the Rescue! Empowering Communities “We can do this. What’s the next step?”
  28. 28. “Apache Olympian?”
  29. 29. What is a “hostile fork?” A "hostile fork" is a fork of a project that goes against the wishes of the copyright holders and/or community.
  30. 30. –DataStax counsel on Apache Incubator mailing list “DataStax does not approve of and objects to the proposed forking of Titan into Olympian or any other ASF project.”
  31. 31. “Apache Olympian?”
  32. 32. Next stop…
  33. 33. Introducing… • Spearheaded by Google, IBM, Hortonworks, Expero, GRAKN.AI • Contributors from Netflix, Amazon, Uber, Orchestral Developments • Sponsored by the Linux Foundation
  34. 34. Introducing… • ALv2 License • Apache style governance model • Source code, issues hosted on GitHub • Mailing lists on Google Groups • Chat on Gitter
  35. 35. Technical Dive
  36. 36. • Optimized for storing/querying billions of vertices and edges • Supports thousands of concurrent users • Can execute local queries (OLTP) or cross-cluster distributed queries (OLAP)
  37. 37. Apache Tinkerpop • THE framework and API for graph manipulation and traversal • Open source, vendor agnostic • Supported by a number of Graph DBs • Promotes portability
  38. 38. Gremlin Query Language • DSL for graph traversal and manipulation • Fluent style API • Multi-language support (Java, Scala, Groovy, Python, Ruby, etc.)
  39. 39. OLAP Integration • Apache Hadoop • Apache Spark • Apache Giraph
  40. 40. • ACID compliant (depending on backend) • Supports very many concurrent transactions • Embedded, Single Node, or Scale out
  41. 41. JanusGraph Architectural Overview
  42. 42. Storage Backends • Well defined storage API allows for easily pluggable implementations • Choose the backend best for your use case and architecture • Options include: Apache HBase, Apache Cassandra, Google Cloud Bigtable, Berkeley DB • More on the way…
  43. 43. Choose Your Own [CAP] Adventure Consistency Availability Partition Tolerance Apache HBase Berkeley DB Apache Cassandra Scylla DB Google Cloud Bigtable
  44. 44. JanusGraph External Indices • Secondary to primary graph storage • Provide a means to speed up graph traversal and information retrieval • Two types: • Graph Index • Vertex-centric Index
  45. 45. Graph Indices • Global index structures across entire graph • Efficient retrieval of vertices and edges based on associated properties • Eliminates need to do a full graph scan • When querying, JanusGraph will typically warn when a full scan is necessary • New indexes take effect immediately, but reindexing may be required
  46. 46. Vertex-Centric Indexes • Local index structures built per-vertex • Eliminates the need to load all vertices from the graph for filtering
  47. 47. Pluggable Index Backends • Elastic Search • Apache Solr • Apache Lucene
  48. 48. Schema and Data Modeling • Consist of edge labels, property keys, vertex labels • Explicit or Implicit • Can evolve over time w/out database downtime • Edge label multiplicity, Property keys, Key cardinality, Vertex labels
  49. 49. Schema - Edge Label Multiplicity • MULTI: Multiple edges of the same label between vertices • SIMPLE: One edge with that label (unique per label) • MANY2ONE: One outgoing edge with that label (mother/children) • ONE2MANY: One incoming edge with that label • ONE2ONE: One incoming, one outgoing edge with that label
  50. 50. Schema - Property Key Data Types
  51. 51. Schema - Property Key Cardinality • SINGLE: At most one value per element. • LIST: Arbitrary number of values per element. Allows duplicates. • SET: Multiple values, but no duplicates.
  52. 52. • Gremlin console: • Groovy-based REPL for exploring the graph • Pre-defined convenience variables, expandable by plugins. E.g.: • “g” — represents the entire graph • “hdfs” — access to hdfs provided by the TinkerPop Hadoop plugin • Local or remote Graph Traversal with Gremlin
  53. 53. ,,,/ (o o) -----oOOo-(3)-oOOo----- 09:12:24 INFO org.apache.tinkerpop.gremlin.hadoop.structure.H plugin activated: tinkerpop.hadoop plugin activated: janusgraph.imports gremlin> Graph Traversal with Gremlin
  54. 54. What path will we be taking today? “Graph of the Gods”
  55. 55. Who is Hercules’ grandfather?
  56. 56. gremlin>
  57. 57. g gremlin> Global variable representing the entire graph
  58. 58. g.V() Select all vertices in the graph gremlin>
  59. 59. g.V().has('name', ‘hercules') Find the vertex that has a ‘name’ Property with the value of ‘hercules’ gremlin>
  60. 60. g.V().has('name', ‘hercules') .out(‘father') Follow outbound edge named ‘father’ to the connected vertex gremlin>
  61. 61. g.V().has('name', ‘hercules') .out(‘father') .out(‘father') Follow outbound edge named ‘father’ to the connected vertex gremlin>
  62. 62. g.V().has('name', ‘hercules') .out(‘father') .out(‘father') .values('name') Select the vertex property ‘name’ gremlin>
  63. 63. g.V().has('name', ‘hercules') .out(‘father') .out(‘father') .values('name') Select the vertex property ‘name’ gremlin>
  64. 64. g.V().has('name', ‘hercules') .out(‘father') .out(‘father') .values('name') gremlin> ==> saturn
  65. 65. What’s in a version number? 1.1 Unreleased 0.1.1 May 16, 2017
  66. 66. Contributions Welcome! • Website: http://janusgraph.org • GitHub Organization: https://github.com/JanusGraph • User Mailing List: janusgraph-user@googlegroups.com • Developer Mailing List: janusgraph-dev@googlegroups.com
  67. 67. Thank you! Questions? P. Taylor Goetz, Hortonworks @ptgoetz

×