Presentation given at DMZ about Data Structure Graphs.
Also known as Applying Social Network Analysis Techniques to Data Modeling and Data Architecture
A Graph is a non-linear data structure, which consists of vertices(or nodes) connected by edges(or arcs) where edges may be directed or undirected.
Graphs are a powerful and versatile data structure that easily allow you to represent real life relationships between different types of data (nodes).
A graph G consists of a non empty set V called the set of nodes (points, vertices) of the graph, a set E, which is the set of edges of the graph and a mapping from the set of edges E to a pair of elements of V.
Any two nodes, which are connected by an edge in a graph are called "adjacent nodes".
In a graph G(V,E) an edge which is directed from one node to another is called a "directed edge", while an edge which has no specific direction is called an "undirected edge". A graph in which every edge is directed is called a "directed graph" or a "digraph". A graph in which every edge is undirected is called an "undirected graph".
If some of edges are directed and some are undirected in a graph then the graph is called a "mixed graph".
Any graph which contains some parallel edges is called a "multigraph".
If there is no more than one edge but a pair of nodes then, such a graph is called "simple graph."
A graph in which weights are assigned to every edge is called a "weighted graph".
In a graph, a node which is not adjacent to any other node is called "isolated node".
A graph containing only isolated nodes is called a "null graph". In a directed graph for any node v the number of edges which have v as initial node is called the "outdegree" of the node v. The number of edges to have v as their terminal node is called the "Indegree" of v and Sum of outdegree and indegree of a node v is called its total degree.
It includes:
Introduction to Graphs
Applications
Graph representation
Graph terminology
Graph operations
Adding vertex and edge in Adjacency matrix representation using C++ program
Adjacency List implementation in C++
Homework Problems
References
A Graph is a non-linear data structure, which consists of vertices(or nodes) connected by edges(or arcs) where edges may be directed or undirected.
Graphs are a powerful and versatile data structure that easily allow you to represent real life relationships between different types of data (nodes).
A graph G consists of a non empty set V called the set of nodes (points, vertices) of the graph, a set E, which is the set of edges of the graph and a mapping from the set of edges E to a pair of elements of V.
Any two nodes, which are connected by an edge in a graph are called "adjacent nodes".
In a graph G(V,E) an edge which is directed from one node to another is called a "directed edge", while an edge which has no specific direction is called an "undirected edge". A graph in which every edge is directed is called a "directed graph" or a "digraph". A graph in which every edge is undirected is called an "undirected graph".
If some of edges are directed and some are undirected in a graph then the graph is called a "mixed graph".
Any graph which contains some parallel edges is called a "multigraph".
If there is no more than one edge but a pair of nodes then, such a graph is called "simple graph."
A graph in which weights are assigned to every edge is called a "weighted graph".
In a graph, a node which is not adjacent to any other node is called "isolated node".
A graph containing only isolated nodes is called a "null graph". In a directed graph for any node v the number of edges which have v as initial node is called the "outdegree" of the node v. The number of edges to have v as their terminal node is called the "Indegree" of v and Sum of outdegree and indegree of a node v is called its total degree.
It includes:
Introduction to Graphs
Applications
Graph representation
Graph terminology
Graph operations
Adding vertex and edge in Adjacency matrix representation using C++ program
Adjacency List implementation in C++
Homework Problems
References
Students can learn about graphs data structures. this PPT covers the following topics in GRAPHS data structures: graph representation, types of graphs, graph traversals like DFS and BFS, Topological Sort, Applications of DFS and BFS.
Graph in data structure it gives you the information of the graph application. How to represent the Graph and also Graph Travesal is also there many terms are there related to garph
Graphs are propular to visualize a problem . Matrix representation is use to convert the graph in a form that used by the computer . This will help to get the efficent solution also provide a lots of mathematical equation .
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.
Students can learn about graphs data structures. this PPT covers the following topics in GRAPHS data structures: graph representation, types of graphs, graph traversals like DFS and BFS, Topological Sort, Applications of DFS and BFS.
Graph in data structure it gives you the information of the graph application. How to represent the Graph and also Graph Travesal is also there many terms are there related to garph
Graphs are propular to visualize a problem . Matrix representation is use to convert the graph in a form that used by the computer . This will help to get the efficent solution also provide a lots of mathematical equation .
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
This is the first of a series of courses which I am putting together for anyone interested in learning about a world class graph database. This technology provides many entity relationship situations which are difficult to express in traditional relational or key-value NOSQL solutions.
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
We live in an era where the world is more connected than ever before and the trajectory is such that data relationships will only continue to increase with no signs of slowing down.
Connected data is the key to your business succeeding and growing in today’s connected world.
Leading enterprises will be the ones that utilize relationship-centric technologies to leverage connections from their internal operations and supply chain to their customer and user interactions. This ability to utilize connected data to understand all the nuanced relationships within their organization will propel them forward as they act on more holistic insights.
Every organization needs a knowledge graph because connected data is an essential foundation to advancing business. Knowledge graphs provide:
- Increased visibility between internal groups
- Efficiency gains
- Cross-functional data collaboration
- Core complete and reliable business insights
- Better customer engagement
The live presentation and discussion can be found here: https://youtu.be/7vBdlXzhs_4
Additional reading on why connected data is beneficial: https://www.graphgrid.com/why-connected-data-is-more-useful/
Connected data solutions available by Benjamin and his team via GraphGrid and AtomRain: https://www.graphgrid.com and https://www.atomrain.com
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
Lingua talk: Italiano.
Descrizione:
In questo talk parleremo di come integrare e utilizzare ArangoDB, un database multi-modello con supporto nativo ai grafi, con R. Presenteremo quindi aRangodb, il package che abbiamo sviluppato per interfacciarsi in modo più semplice e intuitivo al database. Nel corso del talk mostreremo come il package possa essere utilizzato in ambito data science usando alcuni case studies concreti.
Speaker:
Gabriele Galatolo - Data Scientist - Kode srl
Knowledge graphs generation is outpacing the ability to intelligently use the information that they contain. Octavian's work is pioneering Graph Artificial Intelligence to provide the brains to make knowledge graphs useful.
Our neural networks can take questions and knowledge graphs and return answers. Imagine:
a google assistant that reads your own knowledge graph (and actually works)
a BI tool reads your business' knowledge graph
a legal assistant that reads the graph of your case
Taking a neural network approach is important because neural networks deal better with the noise in data and variety in schema. Using neural networks allows people to ask questions of the knowledge graph in their own words, not via code or query languages.
Octavian's approach is to develop neural networks that can learn to manipulate graph knowledge into answers. This approach is radically different to using networks to generate graph embeddings. We believe this approach could transform how we interact with databases.
This is one of the courses that I developed and presented throughout the company. Note: This deck has been sanitized removing all intellectual property. etc.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. Introduction
@dougneedham
Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now Data
Scientist.
Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
I have a strong relational/traditional background.
Perpetual Student
Learning new things challenges our assumptions. Forces us to take a new perspective
on “old” problems. Eventually maybe even shows us that there is a better way to solve
a problem.
3. Introducing Data Structure Graphs
Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity Relationship
Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
A DSG-L1 can show you where you are going to have the most interesting query
performance of your tables.
Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an application.
Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow
diagrams.
A DSG-L2 can show you where the most amount of work is going on in your
Enterprise.
Data Structure Graph Dependency (DSG-D) – Each vertex is a job, script, program, or
process that is dependent on something happening in sequence before it can do its
work.
A DSG-D can show you the sequence of events that need to take place in order for
something to be completed.
4. Definition
A Data Structure Graph is a group of atomic entities that are related to each other,
stored in a repository, then moved from one persistence layer to another, rendered as
a Graph.
A group of atomic entities.
Related to each other.
Stored in a repository.
Moved from one persistence layer to another.
Rendered as a Graph.
5. In summary: Social Network analysis applied
to data modeling.
Data modeling is a topic we are all familiar with here at data modeling zone.
Social Network analysis is, perhaps, something new.
So a little background on the topic we may not be familiar with.
6. What is Social Network Analysis?
“Social network analysis (SNA) is a strategy for investigating social structures through the use of network and
graph theories.
It characterizes networked structures in terms of nodes (individual actors, people, or things within the network)
and the ties or edges (relationships or interactions) that connect them.
Examples of social structures commonly visualized through social network analysis include
social media networks,
friendship and
acquaintance networks,
kinship,
disease transmission, and
sexual relationships.
These networks are often visualized through sociograms in which nodes are represented as points and ties are represented
as lines.” – Wikipedia
https://en.wikipedia.org/wiki/Social_network_analysis
7. Example From wiki:
"Kencf0618FacebookNetwork" by Kencf0618 -
Own work. Licensed under CC BY-SA 3.0 via
Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Kencf0
618FacebookNetwork.jpg#/media/File:Kencf061
8FacebookNetwork.jpg
8. A little History
The 7 Bridges of Konigsberg
Every tome on Graph theory or Network analysis devotes a small portion of there time
to the 7 Bridges of Konigsberg.
If I don’t cover this with you, the gods of mathematics will strike me down, and never
allow me to do analysis again in the future.
10. The Problem
Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people
would wonder if one particular route was more efficient than another.
Eventually Leonhard Euler was brought into the debate about the efficiency problem.
Euler used Vertices to represent the land masses and edges (or arcs, at the time) to
represent bridges. He realized the odd number of edges per vertex made the problem
unsolvable.
Sarada Herke provides for one of the best explanations of the solution Solution to
Konigsberg
Basically the solution is that a vertex must have an even number of edges in order to make
it possible to start from one vertex, and arrive at the point of origin without crossing any
edge twice. Essentially, the number of bridges must be an even number. (more details in
the above video)
And here is the cool thing about mathematicians. If we tell you something is impossible, we
have to tell you why in a way you can understand it. But he also invented the branch of
mathematics today we call Graph Theory.
http://en.wikipedia.org/wiki/Leonhard_Euler
11. A few terms
Stand back, we are going to talk about math!
Basically we are talking about a bunch of dots joined together by lines
Vertex – Dot on a graph
Edge – Line connecting the two points
Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a
path. If you label your edges, and you have multiple edges with the same label in a Graph you can
quite easily identify walks, paths, and cycles through your graph.
A lot of things are networks if you look at them the right way.
Mark Newman has done a number of really cool presentations, available on YouTube about Network
analysis.
https://www.youtube.com/watch?v=lETt7IcDWLI
12. More terms
What is a path?
Shortest path – How are two vertices connected?
Longest Path – Tracing the flow of an interesting item through a large collection of
applications.
Directed Graphs – or Digraphs
If you rearrange things how does the layout affect understanding?
This is not just data visualization, it can also be used for prediction.
https://www.youtube.com/watch?v=rwA-y-XwjuU
13. Final terms
Centrality – Hub and Authority
This is almost a whole topic by itself, since there are different types of Centrality:
Degree Centrality, Eigenvector Centrality, PageRank, etc…
Longest Path – Tracing the flow of an interesting item through a large collection of applications.
Power law.
What is a path?
Centrality – Hub and Authority
This is almost a whole topic by itself, since there are different types of Centrality:
Degree Centrality, Eigenvector Centrality, PageRank, etc…
Transitivity
Homophily – how things are similar
Directed Graphs – or Digraphs
Contagion – How do things “spread” through a network?
Let’s rearrange things, how does the layout affect understanding?
Order of a graph – number of vertices
Size of the graph – number of edges
This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU
14. The Math doesn’t change.
One thing I like about Graphs –
The Math does not change.
The math behind Graph theory can be a little intense, but it does not change
regardless of the scale of the graph.
Once you understand how to “do the math” on a small graph, those same Math's
apply to a Graph whether it is a graph of the people in this room, or a graph of the
people on this planet.
15. Before we get to the analysis we must collect data.
Dbeaver can reverse engineer an ERD.
Point it at the source system, select a few options, then you have a diagram.
I wrote a small piece of Python code to translate the XML to a file suitable for import into Gephi.
One small caveat: the Foreign keys have to be defined for Dbeaver to work. If the foreign keys are not
defined the output file will need to be modified.
Also, some aggregate or summary tables may not help your visualization.
This is subjective, so it is at the discretion of the person reviewing the diagram.
If you remove tables from the graph, please provide documentation such that the visualization can be
compared to the reality of your data model with no discrepancies.
Url for Dbeaver is here: https://dbeaver.jkiss.org/
(This section is a little hand-wavy I know but the tool, or method for creating the file for import into
Gephi is largely irrelevant.)
16. Gephi
http://gephi.github.io/
From the website: “Gephi is an interactive visualization and exploration platform for all
kinds of networks and complex systems, dynamic and hierarchical graphs.”
We are going to use data from generated from my book: Data Structure Graphs.
These are inspired by my experience consulting, but do not represent an actual data
model, or etl process.
The following slides are for a DSG Level 2 (Etl process).
32. Finally, here we are.
Within a Data Architecture there are lots of moving pieces. ETL, FTP, SFTP, Web-
Services, External data feeds. Data moving into Data Marts, and Data Warehouses.
Data Moving between applications.
Let’s imagine how to visualize this using the information we just gained.
33. Data Structure Graphs
Today, there are a few tools like ERWin, and SQL Developer that begin to organize
visualizations in this manner.
Very few of them allow you to perform analysis on the visualization.
As you find new tools that do this, please let me know.
I would love to evaluate those tools and see what interesting metrics can be arrived at
from new tools.
34. Dijkstra's algorithm
Some of you may have heard of Dijkstra’s algorithm.
It is a method for finding the shortest path between two nodes on a Graph.
This is a great optimization technique, but what if you need to find the longest path?
What “Edge_Label” has the most influence on my organization?
Iterate through each Edge_Label, create a subgraph that consists of only the nodes
this Edge_Label touches, then calculate the diameter of that Graph.
The Edge_Label that is longest has the most “impact” on your organization.
This is mostly applied to Data Structure Graph Level 2.
35. Now let’s answer some questions.
Which table is “most important” to ensure you are importing to build a data warehouse?
The tables with the higher centrality measures.
For an operational system these will also be the tables that have the most queries written against them.
These will be your bottlenecks for any system.
Is this data model optimized for reading or writing?
What is the density of the data model?
The higher density is optimized for write, lower density is optimized for read.
36. Barabasi-Albert model and Scale free networks.
Preferential attachment.
There are a few different models available for analysis and prediction of networks.
A Barabsi-Albert model can be summarized as a “rich get richer” model. In other words, the more
connected a node is, when new nodes are added, they are more than likely connected to these well
connected nodes.
This suspiciously sounds similar to our data modeling concepts related to conformed dimensions.
My suspicion is there are many data models that fit this model.
Please send me some anonymized data models. I want to research this more.
37. Some theoretical thoughts.
Let’s assume we have an equation for the growth of every table we have collected
from our little topological study above(more on this in a couple slides).
Let us further assume we have a graph of the same tables.
Can you do anything interesting with this?
The derivative of each equation shows us the growth rate of the table.
What happens if we plug that derivative in the entropy equation for the graph?
What would this represent?
Could this be considered an valuation method?
A way to put a dollar value on a data model?
If you try it, let me know what you find out.
38. Apply the theory.
Using a few metrics from each table we can do some clustering.
Take the number of columns of a table, the centrality measure, and the growth rate
you have a vector for each table.
Doing some simple cosine similarity on these vectors will tell you mathematically which
tables are similar.
Is this finding consistent with expectations?
If not should the model be adjusted?
What does this result say to you?
39. Deriving the growth rate of each table.
Little R demonstration to follow.
Using a design methodology like the data vault mandates that every table have date timestamps for when the
data is loaded.
Collect how many records are loaded per day.
A calculation that represents the growth formula for each table can be derived with R.
Using the growth rate, centrality, and the width of a table (number of columns) you can do cosine similarity
to determine the tables that are mathematically similar to each other.
Using this information you may be able to reallocate the infrastructure that the data warehouse sits on.
Is every table stored on the same disk storage media? Does it need to be?
How about caching? Using these metrics alone you can make a well informed decision about your storage
platform.
The following image is a small topological representation of this process.
This is still slightly theoretical, and I welcome having a conversation with anyone that may want to know
more.
Again, send me anonymized data. Hopefully along with the Data Structure Graph you generated from your
data.
41. Consider the following:
If you need assistance, contact me directly (I am easy to find @dougneedham)
Network/Graph Analysis is cool.
It can show you some interesting things about your data that you may not have
considered.
42. What did I leave out?
Graphs that change over time – What happens when you remove a single Edge or
Vertex?
Comparing two networks – If you have the same number of edges and nodes, are two
graphs the same?
Contagion – How will data spread through the network. (Since a DSG represents
different types of Edges based on Edge_Label, Contagion should not affect the entire
network). This is also commonly known as data lineage. If you don’t have a tool that
does it, with a bit of metadata management this can be derived from a Data Structure
Graph Level 2
43. Other Analysis
What else can be done with Social Network Analysis?
How about risk exposure to banks?
http://www.federalreserve.gov/newsevents/speech/yellen20130104a.htm
45. One other cool bit of Math
How many reports can your dimensional data model support?
Do you have the situation where people want to create a project out of a report,
rather than do a proper data model design up front?
Here is some help.
The upper bound of the total number of reports that a conformed dimension data
model can support is calculated by:
Calculate the number of selectable columns in each dimension (2 𝑐
− 1)
Create the adjacency matrix for the dimensions to facts
A bit of multiplication.
More details here: http://bit.ly/MeasuringDimensionalModels
46. Graphs are Cool!
Help me.
Please send me anonymized data.
In order to present more about how the mathematics of Graph theory, and social
network analysis can be applied in general to the application of data modeling, I need
more data.
This is a fascinating topic, if you want to reach out to me directly I can be reached at:
dougthedataguy@gmail.com
Here is my GitHub for the code and data from the book, and examples:
http://bit.ly/DataStructureGraph_github
This is an overview of what I call Data Structure Graphs given at the Data Modeling Zone conference in October of 2017.
Let me introduce myself.
I have an incredibly traditional background. I have certifications in Oracle and SQL Server, I started my career as a mainframe DBA.