© DataStax, All Rights Reserved. Confidential
How Do *You* Do
Graph?
Ben Krug
Technical Support Engineer, DataStax
1
Who am I?
© 2016 DataStax, All Rights Reserved. 2
A Technical Support Engineer at DataStax.
Previously, Support Engineer at MySQL, then Sun, then Oracle.
Before that, a DBA / sysadmin for banks, utilities, startups, medical and
insurance companies, etc.
Over 25 years in DBMSs, from ISAM and hierarchical to relational,
NoSQL, and graph.
Blogs: formerly oracle2mysql.wordpress.com, now
intertubes.wordpress.com
Disclaimer: Any opinions given are my own!
My topic:
How to look at graphs (the best way?)
© 2016 DataStax, All Rights Reserved. 3
● This will be an opinionated discussion!
● Is there a best way?
● We've probably all done a lot of relational - does that help?
Goals:
© 2016 DataStax, All Rights Reserved. 4
● Discuss DM theory (compare and contrast) and some FUD
● Give an overview of some tools, in the context of the discussion (focused
on Tinkerpop, Spark, etc, relating to the DSE Graph implementations)
1st, what's a (property) graph?
© 2016 DataStax, All Rights Reserved. 5
● A collection of labeled nodes and (directed) edges
● Formally, one example of a definition is:
G = (V,E,λ), where V is a set of vertices, E (V ×V) is a multi-set of directed binary edges, and λ : ((V⊆ ∪
E) × Σ ) → (U  (V E)) is a partial function that maps an element/string pair to an object in the universal∗ ∪
set U (excluding vertices and edges as allowed property values).*
* The Gremlin Graph Traversal Machine and Language, Marko A. Rodriguez, 2015 Proceedings of the ACM Database
Programming Languages Conference
By contrast, what's a relational
database?
© 2016 DataStax, All Rights Reserved. 6
● A collection of rows and columns, organized into tables?
● wikipedia: a digital database based on the relational model of data, as proposed by E. F. Codd in 1970.
● google dictionary: a database structured to recognize relations among stored items of information.
● Formally, one example of a definition is: ?
○ Maybe we could base one on the relational algebra, but it's all very
wordy and difficult to pin down concisely.
● Or, we can say an RDBMS is one that adheres to "Codd's 12 rules"
which might mean that none truly exist (see, eg, rule 6, the "view updating rule)!
By contrast, what's a relational
database?
© 2016 DataStax, All Rights Reserved. 7
● A collection of rows and columns, organized into tables?
● wikipedia: a digital database based on the relational model of data, as proposed by E. F. Codd in 1970.
● google dictionary: a database structured to recognize relations among stored items of information.
● Formally, one example of a definition is: ?
● We can base it on the relational algebra, but it's all very wordy and
difficult to pin down concisely.
● Or, we can say an RDBMS is one that adheres to "Codd's 12 rules"
which might mean that none truly exist (see, eg, rule 6, the "view updating rule)
Let's pretend that we know what we mean - basically, tables of rows and
columns, normalized to some degree, with integrity constraints, "etc".
Importantly, it separates logical view from physical storage.
Things we might hear (that I disagree
with)
© 2016 DataStax, All Rights Reserved. 8
● Graph is an entirely new world, wholly distinct and separate from
relational. Relational is just a ball and chain.
"The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld)
Things we might hear (that I disagree
with)
© 2016 DataStax, All Rights Reserved. 9
● Graph is an entirely new world, wholly distinct and separate from
relational. Relational is just a ball and chain.
"The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld)
● Relational is about a static view of bits of data, not relations. (!)
"Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/)
Things we might hear (that I disagree
with)
© 2016 DataStax, All Rights Reserved. 10
● Graph is an entirely new world, wholly distinct and separate from
relational. Relational is just a ball and chain.
"The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld)
● Relational is about a static view of bits of data, not relations. (!)
"Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/)
● "Native" graph databases must be better than "non-native".
(for a good - and fun - rebuttal of this, see https://www.datastax.com/dev/blog/a-letter-
regarding-native-graph-databases)
Things we might hear (that I disagree
with)
© 2016 DataStax, All Rights Reserved. 11
● Graph is an entirely new world, wholly distinct and separate from
relational. Relational is just a ball and chain.
"The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld)
● Relational is about a static view of bits of data, not relations. (!)
"Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/)
● "Native" graph databases must be better than "non-native".
(for a fun rebuttal of this, see https://www.datastax.com/dev/blog/a-letter-regarding-native-
graph-databases)
● Joins are inherently slower than graph traversals.
this one calls for its own slide...
Are joins slower than traversals? It
depends...
© 2016 DataStax, All Rights Reserved. 12
Is O(k) faster than O(log(n)) when k << n? (k is constant) Not always!
Eg, Friends of Friends - say k=avg # of friends, n=number of people.
(This is a common example given.)
O(k) = O(1) - it must be fast, right? Not necessarily… For example,
"read an entry from a list of length n":
This is not entirely facetious! The clock time depends on the
algorithms that store, read, and deserialize the data, and what data
they need to process in order to find the results.
O(1) algorithm:
read the entire disk, and
return the the entry asked for.
O(log(n)) algorithm:
use an index to find the entry,
and return it.
(eg, see An Evaluation of Alternative Physical Graph Data Designs for Processing Interactive Social Networking Actions, Ghandeharizadeh,
Boghrati, and Barahmand, Database Laboratory Technical Report, Computer Science Department, USC, 2014)
Examples from a graph company's site
© 2016 DataStax, All Rights Reserved. 13
This all comes from the first page that came up (paid) when I googled "index free adjacency". Or, it's the first link if you google "graph
databases future".
"Some graph databases use native graph storage that is specifically designed to store and manage
graphs – from bare metal on up. Other graph technologies use relational, columnar or object-
oriented databases as their storage layer. Non-native storage is often slower than a native approach
because all of the graph connections have to be translated into a different data model."
This, we've already discussed - what is "native" graph storage? What guarantees that it's faster?
"Native graph processing (a.k.a. index-free adjacency) is the most efficient means of processing data
in a graph because connected nodes physically point to each other in the database. Non-native
graph processing engines use other means to process Create, Read, Update or Delete (CRUD)
operations that aren’t optimized for handling connected data."
As you scale to multiple nodes, "physically pointing" becomes meaningless. And again, what guarantees that
it's faster? And the last sentence is just FUD, any mature database is heavily optimized for CRUD, as are
RDBMSs. (For a nice rebuttal, focusing on a single server implementation, see
https://www.arangodb.com/2016/04/index-free-adjacency-hybrid-indexes-graph-databases/ .)
Examples from a graph company's site
(cont)
© 2016 DataStax, All Rights Reserved. 14
This all comes from the first page that came up (paid) when I googled "index free adjacency". Or, it's the first link if you google "graph
databases future".
"With traditional databases, relationship queries come to a grinding halt as the number and depth of
relationships increase. In contrast, graph database performance stays constant even as your data
grows year over year."
- will traversals stay constant as edges increase? What about queries that span the graph?
- for one example of numbers - https://intertubes.wordpress.com/2017/11/28/benchmarketing-neo4j-and-
mysql/
"With graph databases, your IT and data architecture teams move at the speed of business because
the structure and schema of a graph data model flex as your solutions and industry change. Your
team doesn’t have to exhaustively model your domain ahead of time (and then exhaustively remodel
and migrate the DB after some exec asks for a change)."
This overstates the impacts of a schema change in a well-designed system.
"your application doesn’t have to infer data connections using things like foreign keys or out-of-band
processing, like MapReduce."
Edges are not so different than foreign keys; is it really so difficult to "infer"? What does "out-of-band" mean?
Now, on to the reality (as I see it)
© 2016 DataStax, All Rights Reserved. 15
Now, on to the reality (as I see it)
© 2016 DataStax, All Rights Reserved. 16
● For data models, there's often a mapping between graph DBs with and
relational DBs.
Now, on to the reality (as I see it)
© 2016 DataStax, All Rights Reserved. 17
● For data models, there's often a mapping between graph DBs and
relational DBs
It's hard to formalize, because relational is hard to formalize.
Now, on to the reality (as I see it)
© 2016 DataStax, All Rights Reserved. 18
● For data models, there's often a mapping between graph DBs and
relational DBs
It's hard to formalize, because relational is hard to formalize.
A picture, with a common example of a graph DB:
Now, on to the reality (as I see it)
© 2016 DataStax, All Rights Reserved. 19
● For data models, there's often a mapping between graph DBs and
relational DBs
It's hard to formalize, because relational is hard to formalize.
● How about integrity constraints?
○ attributes must be of the types specified in the schema
○ (nullable) FKs must reference a field in child table; an edge must point
to a vertex.
As a result, many ways to view a graph
© 2016 DataStax, All Rights Reserved. 20
● A graph view (gremlin)
● A relational view (SparkSQL, dseGraphFrames, Studio (JDBC/ODBC))
● A graphical graph view (Studio)
● sometimes, the underlying storage view (in DSE, CQL (for 7eet h4x0rs))
How about data access methods?
© 2016 DataStax, All Rights Reserved. 21
● It's often said that SQL/CQL are declarative
● gremlin offers an imperative access method
● Is declarative for relational, and imperative for graphs?
● It's not so simple...
CQL is declarative; is SQL declarative?
© 2016 DataStax, All Rights Reserved. 22
● A lot of SQL is declarative. declarative is "relational", esp in early SQL.
● SQL '99 offers CTEs, which have an imperative element.
● Many people use UDFs with SQL/CQL, or embed SQL/CQL in imperative
(procedural) code.
● Most SQL vendors offer extenstions, like PL/SQL or TSQL, with
imperative (procedural) elements.
Even declarative SQL can be
imperative-ish
© 2016 DataStax, All Rights Reserved. 23
● Consider an equality lookup, joined to a child, joined to a child. (Eg,
Friends of Friends - a "graphy" query.)
● We know this will be implemented in one way - look up the key, go look
up the child, then look at its child (like a traversal).
● Does that make declarative SQL imperative?
gremlin is declarative and imperative
© 2016 DataStax, All Rights Reserved. 24
● An example: managers of "gremlin's" collaborators:
declarative:
g.V().match(
as("a").has("name","gremlin"),
as("a").out("created").as("b"),
as("b").in("created").as("c"),
as("c").in("manages").as("d"),
where("a",neq("c"))).
select("d").
groupCount().by("name")
imperative:
g.V().has("name","gremlin").as("a"
).
out("created").in("created").
where(neq("a")).
in("manages").
groupCount().by("name")
taken from https://tinkerpop.apache.org/gremlin.html
comparing: SQL and gremlin
© 2016 DataStax, All Rights Reserved. 25
There's a nice site comparing SQL and gremlin for a graph database with a
schema: sql2gremlin.com. Some examples:
SELECT DISTINCT LEN(CategoryName)
FROM Categories
SELECT ProductName, UnitPrice
FROM Products
WHERE UnitPrice >= 5 AND UnitPrice < 10
SELECT Products.ProductName
FROM Products
INNER JOIN Categories
ON Categories.CategoryID =
Products.CategoryID
WHERE Categories.CategoryName =
'Beverages'
g.V().hasLabel("category").values("name").
map {it.get().length()}.dedup()
g.V().has("product", "unitPrice",
between(5f,10f)).valueMap("name", "unitPrice")
g.V().has("name","Beverages").in("inCategory").
values("name")
SparkSQL is declarative and imperative
© 2016 DataStax, All Rights Reserved. 26
● The data model currently exposed is a bit hard, very unnormalized -
vertices table and edges table
● eg (using Studio), to find killrvideo movies Daniel Day-Lewis acted in:
select movie.title
from killrvideo_vertices actor
join killrvideo_edges acts_in
join killrvideo_vertices movie
on actor.name = 'Daniel Day-Lewis'
and acts_in.dst = actor.id
and acts_in.src = movie.id;
● imperative, because you can use spark functionality
"relational" - DseGraphFrames
● DseGraphFrames are an extenstion of Spark GraphFrames
○ GraphFrames are built on DataFrames. (DataFrames add a bit of
relational ease to Spark.)
● Some simple examples (from our docs and databricks docs):
g.E().groupCount().by(T.label)
g.V().has("age", P.gt(30)).show
g.V().hasLabel("person").drop().iterate()
● DseGraphFrames support GraphX (and other) Spark libraries
○ GraphX has functions for PageRank, connected components,
shortest path, etc
27
You can also use CQL to peek under
the hood
© 2016 DataStax, All Rights Reserved. 28
● However, this is generally not so useful
cassandra@cqlsh:killrvideo> select * from person_p where name = 'Daniel Day-Lewis' allow filtering;
community_id | member_id | ~~property_key_id | ~~property_id | name | personId | ~~vertex_exists
----------------------+----------------+----------------------------+----------------------------------------------------------+-------------------------+--------------+----------------------
1916035712 | 76 | 32771 | 00000000-0000-8003-0000-000000000000 | Daniel Day-Lewis | null | null
…
cassandra@cqlsh:killrvideo> select * from person_e where community_id = 1916035712 and member_id = 76;
community_id | member_id | ~~edge_label_id | ~~adjacent_vertex_id | ~~adjacent_label_id | ~~edge_id |
~~edge_exists | ~~simple_edge_id
---------------------+-----------------+-------------------------+-------------------------------------------+------------------------------+--------------------------------------------------------
+-----------------------------+------------------
1916035712 | 76 | 65571 | 0x1121e2800000000000000205 | 4 | a018fe3c-b87f-11e8-9882-f533cf6b1c0d |
True | null
...
This is an implementation, storage detail, not a logical view!
My vision - all-in-one
● Seemlessly conceptualize and access your graph in many ways
○ declarative, imperative, graphy, "relational" - avoid FUD!
● We're going in that direction!
● Use the best tools for the job at hand.
29
How does DSE integrate all this and help?
● DSE Graph implements Tinkerpop, gremlin, DseGraphFrames, and more
(eg, Solr integration).
● gremlin gives you powerful imperative and declarative methods, for
traversals and graph analyses.
● SparkSQL/DseGraphFrames do too, with the power of Spark.
● gremlin language variants (GLVs) allow you to program in your favorite
(supported) language.
● Solr can be integrated, which adds a declarative element to narrow
queries.
● gremlin OLAP uses SparkSQL to speed complex queries.
● DSE Studio offers graphical views and gremlin and SparkSQL access.
30
© DataStax, All Rights Reserved. Confidential
How Do *You* Do Graph?
Ben Krug
Technical Support Engineer, DataStax
ben.krug@datastax.com
31

How do You Graph

  • 1.
    © DataStax, AllRights Reserved. Confidential How Do *You* Do Graph? Ben Krug Technical Support Engineer, DataStax 1
  • 2.
    Who am I? ©2016 DataStax, All Rights Reserved. 2 A Technical Support Engineer at DataStax. Previously, Support Engineer at MySQL, then Sun, then Oracle. Before that, a DBA / sysadmin for banks, utilities, startups, medical and insurance companies, etc. Over 25 years in DBMSs, from ISAM and hierarchical to relational, NoSQL, and graph. Blogs: formerly oracle2mysql.wordpress.com, now intertubes.wordpress.com Disclaimer: Any opinions given are my own!
  • 3.
    My topic: How tolook at graphs (the best way?) © 2016 DataStax, All Rights Reserved. 3 ● This will be an opinionated discussion! ● Is there a best way? ● We've probably all done a lot of relational - does that help?
  • 4.
    Goals: © 2016 DataStax,All Rights Reserved. 4 ● Discuss DM theory (compare and contrast) and some FUD ● Give an overview of some tools, in the context of the discussion (focused on Tinkerpop, Spark, etc, relating to the DSE Graph implementations)
  • 5.
    1st, what's a(property) graph? © 2016 DataStax, All Rights Reserved. 5 ● A collection of labeled nodes and (directed) edges ● Formally, one example of a definition is: G = (V,E,λ), where V is a set of vertices, E (V ×V) is a multi-set of directed binary edges, and λ : ((V⊆ ∪ E) × Σ ) → (U (V E)) is a partial function that maps an element/string pair to an object in the universal∗ ∪ set U (excluding vertices and edges as allowed property values).* * The Gremlin Graph Traversal Machine and Language, Marko A. Rodriguez, 2015 Proceedings of the ACM Database Programming Languages Conference
  • 6.
    By contrast, what'sa relational database? © 2016 DataStax, All Rights Reserved. 6 ● A collection of rows and columns, organized into tables? ● wikipedia: a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. ● google dictionary: a database structured to recognize relations among stored items of information. ● Formally, one example of a definition is: ? ○ Maybe we could base one on the relational algebra, but it's all very wordy and difficult to pin down concisely. ● Or, we can say an RDBMS is one that adheres to "Codd's 12 rules" which might mean that none truly exist (see, eg, rule 6, the "view updating rule)!
  • 7.
    By contrast, what'sa relational database? © 2016 DataStax, All Rights Reserved. 7 ● A collection of rows and columns, organized into tables? ● wikipedia: a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. ● google dictionary: a database structured to recognize relations among stored items of information. ● Formally, one example of a definition is: ? ● We can base it on the relational algebra, but it's all very wordy and difficult to pin down concisely. ● Or, we can say an RDBMS is one that adheres to "Codd's 12 rules" which might mean that none truly exist (see, eg, rule 6, the "view updating rule) Let's pretend that we know what we mean - basically, tables of rows and columns, normalized to some degree, with integrity constraints, "etc". Importantly, it separates logical view from physical storage.
  • 8.
    Things we mighthear (that I disagree with) © 2016 DataStax, All Rights Reserved. 8 ● Graph is an entirely new world, wholly distinct and separate from relational. Relational is just a ball and chain. "The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld)
  • 9.
    Things we mighthear (that I disagree with) © 2016 DataStax, All Rights Reserved. 9 ● Graph is an entirely new world, wholly distinct and separate from relational. Relational is just a ball and chain. "The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld) ● Relational is about a static view of bits of data, not relations. (!) "Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/)
  • 10.
    Things we mighthear (that I disagree with) © 2016 DataStax, All Rights Reserved. 10 ● Graph is an entirely new world, wholly distinct and separate from relational. Relational is just a ball and chain. "The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld) ● Relational is about a static view of bits of data, not relations. (!) "Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/) ● "Native" graph databases must be better than "non-native". (for a good - and fun - rebuttal of this, see https://www.datastax.com/dev/blog/a-letter- regarding-native-graph-databases)
  • 11.
    Things we mighthear (that I disagree with) © 2016 DataStax, All Rights Reserved. 11 ● Graph is an entirely new world, wholly distinct and separate from relational. Relational is just a ball and chain. "The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld) ● Relational is about a static view of bits of data, not relations. (!) "Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/) ● "Native" graph databases must be better than "non-native". (for a fun rebuttal of this, see https://www.datastax.com/dev/blog/a-letter-regarding-native- graph-databases) ● Joins are inherently slower than graph traversals. this one calls for its own slide...
  • 12.
    Are joins slowerthan traversals? It depends... © 2016 DataStax, All Rights Reserved. 12 Is O(k) faster than O(log(n)) when k << n? (k is constant) Not always! Eg, Friends of Friends - say k=avg # of friends, n=number of people. (This is a common example given.) O(k) = O(1) - it must be fast, right? Not necessarily… For example, "read an entry from a list of length n": This is not entirely facetious! The clock time depends on the algorithms that store, read, and deserialize the data, and what data they need to process in order to find the results. O(1) algorithm: read the entire disk, and return the the entry asked for. O(log(n)) algorithm: use an index to find the entry, and return it. (eg, see An Evaluation of Alternative Physical Graph Data Designs for Processing Interactive Social Networking Actions, Ghandeharizadeh, Boghrati, and Barahmand, Database Laboratory Technical Report, Computer Science Department, USC, 2014)
  • 13.
    Examples from agraph company's site © 2016 DataStax, All Rights Reserved. 13 This all comes from the first page that came up (paid) when I googled "index free adjacency". Or, it's the first link if you google "graph databases future". "Some graph databases use native graph storage that is specifically designed to store and manage graphs – from bare metal on up. Other graph technologies use relational, columnar or object- oriented databases as their storage layer. Non-native storage is often slower than a native approach because all of the graph connections have to be translated into a different data model." This, we've already discussed - what is "native" graph storage? What guarantees that it's faster? "Native graph processing (a.k.a. index-free adjacency) is the most efficient means of processing data in a graph because connected nodes physically point to each other in the database. Non-native graph processing engines use other means to process Create, Read, Update or Delete (CRUD) operations that aren’t optimized for handling connected data." As you scale to multiple nodes, "physically pointing" becomes meaningless. And again, what guarantees that it's faster? And the last sentence is just FUD, any mature database is heavily optimized for CRUD, as are RDBMSs. (For a nice rebuttal, focusing on a single server implementation, see https://www.arangodb.com/2016/04/index-free-adjacency-hybrid-indexes-graph-databases/ .)
  • 14.
    Examples from agraph company's site (cont) © 2016 DataStax, All Rights Reserved. 14 This all comes from the first page that came up (paid) when I googled "index free adjacency". Or, it's the first link if you google "graph databases future". "With traditional databases, relationship queries come to a grinding halt as the number and depth of relationships increase. In contrast, graph database performance stays constant even as your data grows year over year." - will traversals stay constant as edges increase? What about queries that span the graph? - for one example of numbers - https://intertubes.wordpress.com/2017/11/28/benchmarketing-neo4j-and- mysql/ "With graph databases, your IT and data architecture teams move at the speed of business because the structure and schema of a graph data model flex as your solutions and industry change. Your team doesn’t have to exhaustively model your domain ahead of time (and then exhaustively remodel and migrate the DB after some exec asks for a change)." This overstates the impacts of a schema change in a well-designed system. "your application doesn’t have to infer data connections using things like foreign keys or out-of-band processing, like MapReduce." Edges are not so different than foreign keys; is it really so difficult to "infer"? What does "out-of-band" mean?
  • 15.
    Now, on tothe reality (as I see it) © 2016 DataStax, All Rights Reserved. 15
  • 16.
    Now, on tothe reality (as I see it) © 2016 DataStax, All Rights Reserved. 16 ● For data models, there's often a mapping between graph DBs with and relational DBs.
  • 17.
    Now, on tothe reality (as I see it) © 2016 DataStax, All Rights Reserved. 17 ● For data models, there's often a mapping between graph DBs and relational DBs It's hard to formalize, because relational is hard to formalize.
  • 18.
    Now, on tothe reality (as I see it) © 2016 DataStax, All Rights Reserved. 18 ● For data models, there's often a mapping between graph DBs and relational DBs It's hard to formalize, because relational is hard to formalize. A picture, with a common example of a graph DB:
  • 19.
    Now, on tothe reality (as I see it) © 2016 DataStax, All Rights Reserved. 19 ● For data models, there's often a mapping between graph DBs and relational DBs It's hard to formalize, because relational is hard to formalize. ● How about integrity constraints? ○ attributes must be of the types specified in the schema ○ (nullable) FKs must reference a field in child table; an edge must point to a vertex.
  • 20.
    As a result,many ways to view a graph © 2016 DataStax, All Rights Reserved. 20 ● A graph view (gremlin) ● A relational view (SparkSQL, dseGraphFrames, Studio (JDBC/ODBC)) ● A graphical graph view (Studio) ● sometimes, the underlying storage view (in DSE, CQL (for 7eet h4x0rs))
  • 21.
    How about dataaccess methods? © 2016 DataStax, All Rights Reserved. 21 ● It's often said that SQL/CQL are declarative ● gremlin offers an imperative access method ● Is declarative for relational, and imperative for graphs? ● It's not so simple...
  • 22.
    CQL is declarative;is SQL declarative? © 2016 DataStax, All Rights Reserved. 22 ● A lot of SQL is declarative. declarative is "relational", esp in early SQL. ● SQL '99 offers CTEs, which have an imperative element. ● Many people use UDFs with SQL/CQL, or embed SQL/CQL in imperative (procedural) code. ● Most SQL vendors offer extenstions, like PL/SQL or TSQL, with imperative (procedural) elements.
  • 23.
    Even declarative SQLcan be imperative-ish © 2016 DataStax, All Rights Reserved. 23 ● Consider an equality lookup, joined to a child, joined to a child. (Eg, Friends of Friends - a "graphy" query.) ● We know this will be implemented in one way - look up the key, go look up the child, then look at its child (like a traversal). ● Does that make declarative SQL imperative?
  • 24.
    gremlin is declarativeand imperative © 2016 DataStax, All Rights Reserved. 24 ● An example: managers of "gremlin's" collaborators: declarative: g.V().match( as("a").has("name","gremlin"), as("a").out("created").as("b"), as("b").in("created").as("c"), as("c").in("manages").as("d"), where("a",neq("c"))). select("d"). groupCount().by("name") imperative: g.V().has("name","gremlin").as("a" ). out("created").in("created"). where(neq("a")). in("manages"). groupCount().by("name") taken from https://tinkerpop.apache.org/gremlin.html
  • 25.
    comparing: SQL andgremlin © 2016 DataStax, All Rights Reserved. 25 There's a nice site comparing SQL and gremlin for a graph database with a schema: sql2gremlin.com. Some examples: SELECT DISTINCT LEN(CategoryName) FROM Categories SELECT ProductName, UnitPrice FROM Products WHERE UnitPrice >= 5 AND UnitPrice < 10 SELECT Products.ProductName FROM Products INNER JOIN Categories ON Categories.CategoryID = Products.CategoryID WHERE Categories.CategoryName = 'Beverages' g.V().hasLabel("category").values("name"). map {it.get().length()}.dedup() g.V().has("product", "unitPrice", between(5f,10f)).valueMap("name", "unitPrice") g.V().has("name","Beverages").in("inCategory"). values("name")
  • 26.
    SparkSQL is declarativeand imperative © 2016 DataStax, All Rights Reserved. 26 ● The data model currently exposed is a bit hard, very unnormalized - vertices table and edges table ● eg (using Studio), to find killrvideo movies Daniel Day-Lewis acted in: select movie.title from killrvideo_vertices actor join killrvideo_edges acts_in join killrvideo_vertices movie on actor.name = 'Daniel Day-Lewis' and acts_in.dst = actor.id and acts_in.src = movie.id; ● imperative, because you can use spark functionality
  • 27.
    "relational" - DseGraphFrames ●DseGraphFrames are an extenstion of Spark GraphFrames ○ GraphFrames are built on DataFrames. (DataFrames add a bit of relational ease to Spark.) ● Some simple examples (from our docs and databricks docs): g.E().groupCount().by(T.label) g.V().has("age", P.gt(30)).show g.V().hasLabel("person").drop().iterate() ● DseGraphFrames support GraphX (and other) Spark libraries ○ GraphX has functions for PageRank, connected components, shortest path, etc 27
  • 28.
    You can alsouse CQL to peek under the hood © 2016 DataStax, All Rights Reserved. 28 ● However, this is generally not so useful cassandra@cqlsh:killrvideo> select * from person_p where name = 'Daniel Day-Lewis' allow filtering; community_id | member_id | ~~property_key_id | ~~property_id | name | personId | ~~vertex_exists ----------------------+----------------+----------------------------+----------------------------------------------------------+-------------------------+--------------+---------------------- 1916035712 | 76 | 32771 | 00000000-0000-8003-0000-000000000000 | Daniel Day-Lewis | null | null … cassandra@cqlsh:killrvideo> select * from person_e where community_id = 1916035712 and member_id = 76; community_id | member_id | ~~edge_label_id | ~~adjacent_vertex_id | ~~adjacent_label_id | ~~edge_id | ~~edge_exists | ~~simple_edge_id ---------------------+-----------------+-------------------------+-------------------------------------------+------------------------------+-------------------------------------------------------- +-----------------------------+------------------ 1916035712 | 76 | 65571 | 0x1121e2800000000000000205 | 4 | a018fe3c-b87f-11e8-9882-f533cf6b1c0d | True | null ... This is an implementation, storage detail, not a logical view!
  • 29.
    My vision -all-in-one ● Seemlessly conceptualize and access your graph in many ways ○ declarative, imperative, graphy, "relational" - avoid FUD! ● We're going in that direction! ● Use the best tools for the job at hand. 29
  • 30.
    How does DSEintegrate all this and help? ● DSE Graph implements Tinkerpop, gremlin, DseGraphFrames, and more (eg, Solr integration). ● gremlin gives you powerful imperative and declarative methods, for traversals and graph analyses. ● SparkSQL/DseGraphFrames do too, with the power of Spark. ● gremlin language variants (GLVs) allow you to program in your favorite (supported) language. ● Solr can be integrated, which adds a declarative element to narrow queries. ● gremlin OLAP uses SparkSQL to speed complex queries. ● DSE Studio offers graphical views and gremlin and SparkSQL access. 30
  • 31.
    © DataStax, AllRights Reserved. Confidential How Do *You* Do Graph? Ben Krug Technical Support Engineer, DataStax ben.krug@datastax.com 31

Editor's Notes

  • #4 interest in theory and history of data models do we need to throw out the baby with the bath water? how many have some experience with, or knowledge of tinkerpop (especially gremlin)?
  • #5 It&amp;apos;s a bit philosphical want to convince you that graph and relational are not as different as you think they are - they are different, but not *as* different
  • #6 not experienced in semantic or knowledge graphs, they may be less amenable to the coming ideas In mathematics, a multiset (aka bag or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements.
  • #7 also, I think: relational algebra would be more about access methods than the data model rule 6: All views that are theoretically updatable are also updatable by the system.
  • #8 Note: this means that I&amp;apos;m calling Cassandra w/CQL &amp;quot;relational&amp;quot;. It&amp;apos;s a very loose use of the word. Now, I want to clear the air a bit about &amp;quot;relational vs graph&amp;quot;. If we&amp;apos;re going to consider graph DBs, we need to honestly look at their features and uses (and advantages), not just say &amp;quot;relational bad, graph good&amp;quot;.
  • #9 the quote is just an example I found on the a first try google search. Not the worst!
  • #11 native vs non-native, and Marko&amp;apos;s paper - &amp;quot;native storage of a graph&amp;quot; - how do you serialize a graph (my problem for this talk - how to serialize all these related ideas!)
  • #13 also (more importantly?), both can expand exponentially, as in Denise&amp;apos;s talk, as you repeat.
  • #14 in reality, look at each implementation, and its performance on the kinds of accesses you&amp;apos;ll need to do.
  • #15 schema changes also depend on the implementation - graphs can have schemas, and this was a question after Denise&amp;apos;s talk.
  • #17 In particular, for property graphs, that have some type of schema.
  • #19 note - this is not the way we, or SparkSQL, map the graph to tables! The idea is, they are not separate universes.
  • #20 FKs - otoh, there&amp;apos;s no system-enforced &amp;quot;on delete cascade&amp;quot; Where do these similarities lead us?...
  • #22 what kind of languages should we use? Is there a &amp;quot;best&amp;quot;? one note: declarative are nice for optimizers, imperative are nice if you know best; gremlin &amp;gt; SQL, but relational languages were designed to be declarative in order to enforce and protect data integrity, not by lack of imagination.
  • #23 CQL, Cassandra(NoSQL), relational - &amp;quot;just when I thought I was out, they pull me back in!&amp;quot;
  • #24 the point is not that it&amp;apos;s truly imperative, but that imperative traversals and declarative ones, in themselves, may not be so different, as far as &amp;quot;imperative vs declarative&amp;quot; go. next, gremlin
  • #25 this is a bit like the &amp;quot;imperative&amp;quot; SQL traversal query mentioned - both do the same thing, may well take the same steps. next SparkSQL
  • #26 couldn&amp;apos;t do gremlin2sql w/all of gremlin site uses T-SQL, so possibly you could do gremlin2TSQL? The point is to compare, and also, if you&amp;apos;re used to relational, this can help you get started with gremlin queries. (Or, if you are, also see https://academy.datastax.com/content/gremlin-traversals - google &amp;quot;gremlin recipes datastax&amp;quot;)
  • #27 tradeoffs - knowledge/semantic graph example (no schema) - BUT, you get the power of Spark for analyzing pieces of the graph Studio is a great and easy tool to do a lot of these things … discuss its options, schema views (hidden tables in CQL view), etc
  • #28 I say &amp;quot;relational&amp;quot; in that this is a declarative element in Spark, as if you&amp;apos;re dealing with a vertices table and an edges table. Can also combine graph and non-graph data using SparkSQL, Spark, DseGraphFrames, DataFrames useful for loading data into a graph
  • #29 nodes get _p tables, edges get _e tables. Note, this is like x$ tables in Oracle - not documented, subject to change, etc
  • #30 bicycle and car example - one day, trying to adjust seat height (wrong spot), then also disconnecting battery (engine light)