Data Structure Graph DMZ #DMZone

Data Structure Graphs
An overview
Presentation by @dougneedham

Introduction
 @dougneedham
 Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now Data
Scientist.
 Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
 I have a strong relational/traditional background.
 Perpetual Student
 Learning new things challenges our assumptions. Forces us to take a new perspective
on “old” problems. Eventually maybe even shows us that there is a better way to solve
a problem.

Introducing Data Structure Graphs
 Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity Relationship
Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
 A DSG-L1 can show you where you are going to have the most interesting query
performance of your tables.
 Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an application.
Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow
diagrams.
 A DSG-L2 can show you where the most amount of work is going on in your
Enterprise.
 Data Structure Graph Dependency (DSG-D) – Each vertex is a job, script, program, or
process that is dependent on something happening in sequence before it can do its
work.
 A DSG-D can show you the sequence of events that need to take place in order for
something to be completed.

Definition
 A Data Structure Graph is a group of atomic entities that are related to each other,
stored in a repository, then moved from one persistence layer to another, rendered as
a Graph.
 A group of atomic entities.
 Related to each other.
 Stored in a repository.
 Moved from one persistence layer to another.
 Rendered as a Graph.

In summary: Social Network analysis applied
to data modeling.
 Data modeling is a topic we are all familiar with here at data modeling zone.
 Social Network analysis is, perhaps, something new.
 So a little background on the topic we may not be familiar with.

What is Social Network Analysis?
 “Social network analysis (SNA) is a strategy for investigating social structures through the use of network and
graph theories.
 It characterizes networked structures in terms of nodes (individual actors, people, or things within the network)
and the ties or edges (relationships or interactions) that connect them.
 Examples of social structures commonly visualized through social network analysis include
 social media networks,
 friendship and
 acquaintance networks,
 kinship,
 disease transmission, and
 sexual relationships.
 These networks are often visualized through sociograms in which nodes are represented as points and ties are represented
as lines.” – Wikipedia
 https://en.wikipedia.org/wiki/Social_network_analysis

Example From wiki:
"Kencf0618FacebookNetwork" by Kencf0618 -
Own work. Licensed under CC BY-SA 3.0 via
Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Kencf0
618FacebookNetwork.jpg#/media/File:Kencf061
8FacebookNetwork.jpg

A little History
 The 7 Bridges of Konigsberg
 Every tome on Graph theory or Network analysis devotes a small portion of there time
to the 7 Bridges of Konigsberg.
 If I don’t cover this with you, the gods of mathematics will strike me down, and never
allow me to do analysis again in the future.

The Problem
 Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people
would wonder if one particular route was more efficient than another.
 Eventually Leonhard Euler was brought into the debate about the efficiency problem.
 Euler used Vertices to represent the land masses and edges (or arcs, at the time) to
represent bridges. He realized the odd number of edges per vertex made the problem
unsolvable.
 Sarada Herke provides for one of the best explanations of the solution Solution to
Konigsberg
 Basically the solution is that a vertex must have an even number of edges in order to make
it possible to start from one vertex, and arrive at the point of origin without crossing any
edge twice. Essentially, the number of bridges must be an even number. (more details in
the above video)
 And here is the cool thing about mathematicians. If we tell you something is impossible, we
have to tell you why in a way you can understand it. But he also invented the branch of
mathematics today we call Graph Theory.
 http://en.wikipedia.org/wiki/Leonhard_Euler

A few terms
 Stand back, we are going to talk about math!
 Basically we are talking about a bunch of dots joined together by lines
 Vertex – Dot on a graph
 Edge – Line connecting the two points
 Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a
path. If you label your edges, and you have multiple edges with the same label in a Graph you can
quite easily identify walks, paths, and cycles through your graph.
 A lot of things are networks if you look at them the right way.
 Mark Newman has done a number of really cool presentations, available on YouTube about Network
analysis.
 https://www.youtube.com/watch?v=lETt7IcDWLI

More terms
 What is a path?
 Shortest path – How are two vertices connected?
 Longest Path – Tracing the flow of an interesting item through a large collection of
applications.
 Directed Graphs – or Digraphs
 If you rearrange things how does the layout affect understanding?
 This is not just data visualization, it can also be used for prediction.
https://www.youtube.com/watch?v=rwA-y-XwjuU

Final terms
 Centrality – Hub and Authority
 This is almost a whole topic by itself, since there are different types of Centrality:
 Degree Centrality, Eigenvector Centrality, PageRank, etc…
 Longest Path – Tracing the flow of an interesting item through a large collection of applications.
 Power law.
 What is a path?
 Centrality – Hub and Authority
 This is almost a whole topic by itself, since there are different types of Centrality:
 Degree Centrality, Eigenvector Centrality, PageRank, etc…
 Transitivity
 Homophily – how things are similar
 Directed Graphs – or Digraphs
 Contagion – How do things “spread” through a network?
 Let’s rearrange things, how does the layout affect understanding?
 Order of a graph – number of vertices
 Size of the graph – number of edges
 This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU

The Math doesn’t change.
 One thing I like about Graphs –
 The Math does not change.
 The math behind Graph theory can be a little intense, but it does not change
regardless of the scale of the graph.
 Once you understand how to “do the math” on a small graph, those same Math's
apply to a Graph whether it is a graph of the people in this room, or a graph of the
people on this planet.

Before we get to the analysis we must collect data.
 Dbeaver can reverse engineer an ERD.
 Point it at the source system, select a few options, then you have a diagram.
 I wrote a small piece of Python code to translate the XML to a file suitable for import into Gephi.
 One small caveat: the Foreign keys have to be defined for Dbeaver to work. If the foreign keys are not
defined the output file will need to be modified.
 Also, some aggregate or summary tables may not help your visualization.
 This is subjective, so it is at the discretion of the person reviewing the diagram.
 If you remove tables from the graph, please provide documentation such that the visualization can be
compared to the reality of your data model with no discrepancies.
 Url for Dbeaver is here: https://dbeaver.jkiss.org/
 (This section is a little hand-wavy I know but the tool, or method for creating the file for import into
Gephi is largely irrelevant.)

Gephi
 http://gephi.github.io/
 From the website: “Gephi is an interactive visualization and exploration platform for all
kinds of networks and complex systems, dynamic and hierarchical graphs.”
 We are going to use data from generated from my book: Data Structure Graphs.
 These are inspired by my experience consulting, but do not represent an actual data
model, or etl process.
 The following slides are for a DSG Level 2 (Etl process).

New Project, Data Table, Import data.

Load as “Edges Table” Source, Target (required)

After a few calculations and layout runs

PageRank – Which application is most important?

Where is that Node with the highest PageRank?

Now things get interesting:
 New metrics for our data model follow.
 Remember all those metrics we defined earlier?
 Here are many of them:

Finally, here we are.
 Within a Data Architecture there are lots of moving pieces. ETL, FTP, SFTP, Web-
Services, External data feeds. Data moving into Data Marts, and Data Warehouses.
Data Moving between applications.
 Let’s imagine how to visualize this using the information we just gained.

Data Structure Graphs
 Today, there are a few tools like ERWin, and SQL Developer that begin to organize
visualizations in this manner.
 Very few of them allow you to perform analysis on the visualization.
 As you find new tools that do this, please let me know.
 I would love to evaluate those tools and see what interesting metrics can be arrived at
from new tools.

Dijkstra's algorithm
 Some of you may have heard of Dijkstra’s algorithm.
 It is a method for finding the shortest path between two nodes on a Graph.
 This is a great optimization technique, but what if you need to find the longest path?
 What “Edge_Label” has the most influence on my organization?
 Iterate through each Edge_Label, create a subgraph that consists of only the nodes
this Edge_Label touches, then calculate the diameter of that Graph.
 The Edge_Label that is longest has the most “impact” on your organization.
 This is mostly applied to Data Structure Graph Level 2.

Now let’s answer some questions.
 Which table is “most important” to ensure you are importing to build a data warehouse?
 The tables with the higher centrality measures.
 For an operational system these will also be the tables that have the most queries written against them.
 These will be your bottlenecks for any system.
 Is this data model optimized for reading or writing?
 What is the density of the data model?
 The higher density is optimized for write, lower density is optimized for read.

Barabasi-Albert model and Scale free networks.
 Preferential attachment.
 There are a few different models available for analysis and prediction of networks.
 A Barabsi-Albert model can be summarized as a “rich get richer” model. In other words, the more
connected a node is, when new nodes are added, they are more than likely connected to these well
connected nodes.
 This suspiciously sounds similar to our data modeling concepts related to conformed dimensions.
 My suspicion is there are many data models that fit this model.
 Please send me some anonymized data models. I want to research this more.

Some theoretical thoughts.
 Let’s assume we have an equation for the growth of every table we have collected
from our little topological study above(more on this in a couple slides).
 Let us further assume we have a graph of the same tables.
 Can you do anything interesting with this?
 The derivative of each equation shows us the growth rate of the table.
 What happens if we plug that derivative in the entropy equation for the graph?
 What would this represent?
 Could this be considered an valuation method?
 A way to put a dollar value on a data model?
 If you try it, let me know what you find out.

Apply the theory.
 Using a few metrics from each table we can do some clustering.
 Take the number of columns of a table, the centrality measure, and the growth rate
you have a vector for each table.
 Doing some simple cosine similarity on these vectors will tell you mathematically which
tables are similar.
 Is this finding consistent with expectations?
 If not should the model be adjusted?
 What does this result say to you?

Deriving the growth rate of each table.
 Little R demonstration to follow.
 Using a design methodology like the data vault mandates that every table have date timestamps for when the
data is loaded.
 Collect how many records are loaded per day.
 A calculation that represents the growth formula for each table can be derived with R.
 Using the growth rate, centrality, and the width of a table (number of columns) you can do cosine similarity
to determine the tables that are mathematically similar to each other.
 Using this information you may be able to reallocate the infrastructure that the data warehouse sits on.
 Is every table stored on the same disk storage media? Does it need to be?
 How about caching? Using these metrics alone you can make a well informed decision about your storage
platform.
 The following image is a small topological representation of this process.
 This is still slightly theoretical, and I welcome having a conversation with anyone that may want to know
more.
 Again, send me anonymized data. Hopefully along with the Data Structure Graph you generated from your
data.

This is what the topology may look like.

Consider the following:
 If you need assistance, contact me directly (I am easy to find @dougneedham)
 Network/Graph Analysis is cool.
 It can show you some interesting things about your data that you may not have
considered.

What did I leave out?
 Graphs that change over time – What happens when you remove a single Edge or
Vertex?
 Comparing two networks – If you have the same number of edges and nodes, are two
graphs the same?
 Contagion – How will data spread through the network. (Since a DSG represents
different types of Edges based on Edge_Label, Contagion should not affect the entire
network). This is also commonly known as data lineage. If you don’t have a tool that
does it, with a bit of metadata management this can be derived from a Data Structure
Graph Level 2

Other Analysis
 What else can be done with Social Network Analysis?
 How about risk exposure to banks?
 http://www.federalreserve.gov/newsevents/speech/yellen20130104a.htm

One other cool bit of Math
 How many reports can your dimensional data model support?
 Do you have the situation where people want to create a project out of a report,
rather than do a proper data model design up front?
 Here is some help.
 The upper bound of the total number of reports that a conformed dimension data
model can support is calculated by:
 Calculate the number of selectable columns in each dimension (2 𝑐
− 1)
 Create the adjacency matrix for the dimensions to facts
 A bit of multiplication.
 More details here: http://bit.ly/MeasuringDimensionalModels

Graphs are Cool!
 Help me.
 Please send me anonymized data.
 In order to present more about how the mathematics of Graph theory, and social
network analysis can be applied in general to the application of data modeling, I need
more data. 
 This is a fascinating topic, if you want to reach out to me directly I can be reached at:
dougthedataguy@gmail.com
 Here is my GitHub for the code and data from the book, and examples:
http://bit.ly/DataStructureGraph_github

https://dougneedham.shinyapps.io/DataStructureGraph
Hard to see, I know, but the top diagram is the “master graph”, the bottom image is a single Edge_Label. You can see how an individual
data entity flows through an organization.

My book
Goes through a number of examples for doing an Graph analysis of a fictional organization.

Data Structure Graph DMZ #DMZone

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Structure Graph DMZ #DMZone

Similar to Data Structure Graph DMZ #DMZone (20)

Recently uploaded

Recently uploaded (20)

Data Structure Graph DMZ #DMZone

Editor's Notes