Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Social Network Analysis Introduction including Data Structure Graph overview.


Published on

Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.

Published in: Data & Analytics

Social Network Analysis Introduction including Data Structure Graph overview.

  1. 1. Social Network Analysis An overview Presentation by @dougneedham
  2. 2. Introduction  @dougneedham  Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now Data Scientist.  Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.  I have a strong relational/traditional background.  Perpetual Student  Learning new things challenges our assumptions. Forces us to take a new perspective on “old” problems. Eventually maybe even shows us that there is a better way to solve a problem.
  3. 3. Why study social networks?  It is cool.  The concepts around Social Network Analysis can be applied to many interesting problems in a variety of business verticals.  The foundation of Social Network Analysis is Graph theory.  Solving Crime  Some examples: Introduction to Graph_Theory
  4. 4. What is Social Network Analysis?  “Social network analysis (SNA) is a strategy for investigating social structures through the use of network and graph theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, friendship and acquaintance networks, kinship, disease transmission, and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines.” – Wikipedia 
  5. 5. Example From wiki: "Kencf0618FacebookNetwork" by Kencf0618 - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - Kencf0618FacebookNetwork.jpg#/medi a/File:Kencf0618FacebookNetwork.jpg
  6. 6. A little History  The 7 Bridges of Konisberg  Every tome on Graph theory or Network analysis devotes a small portion of there time to the 7 Bridges of Konisberg.  If I don’t cover this with you, the gods of mathematics will strike me down, and never allow me to do analysis again in the future.
  7. 7. The Bridges
  8. 8. The Problem  Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people would wonder if one particular route was more efficient than another.  Eventually Leonhard Euler was brought into the debate about the efficiency problem.  Euler used Vertices to represent the land masses and edges (or arcs, at the time) to represent bridges. He realized the odd number of edges per vertex made the problem unsolvable.  Sarada Herke provides for one of the best explanations of the solution Solution to Konisburg  And here is the cool thing about mathematicians. If we tell you something is impossible, we have to tell you why in a way you can understand it. But he also invented the branch of mathematics today we call Graph Theory. 
  9. 9. Why analyze Facebook data?  Facebook is something that most people use.  It is easy to see the relationships and the concepts of the Graph/Network are intuitive to people who are looking at their “own” network.  The main idea is, if you can understand your own friend data, you can learn the concepts quickly, then apply these same concepts to more complicated problems.  We will talk a little about some complicated topics at the end.
  10. 10. A few terms  Stand back, we are going to talk about math!  Basically we are talking about a bunch of dots joined together by lines  Vertex – Dot on a graph  Edge – Line connecting the two points  Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a path. If you label your edges, and you have multiple edges with the same label in a Graph you can quite easily identify walks, paths, and cycles through your graph.  Triangle – 3 Vertices, 3 Edges  Square – 4 Vertices, 4 edges  Open Triangle - 3 Vertices, 2 edges /  A lot of things are networks if you look at them the right way.  Mark Newman has done a number of well done presentations, available on Youtube about Network analysis. 
  11. 11. More terms  Transitivity – The friend of my friend is my friend. Really?  Homophily – how things are similar  Directed Graphs – or Digraphs  Contagion – How do things “spread” through a network?  Let’s rearrange things, how does the layout affect understanding?  Order of a graph – number of vertices  Size of the graph – number of edges  This is not just data visualization, it can also be used for prediction.
  12. 12. Final terms  Centrality – Hub and Authority  This is almost a whole topic by itself, since there are different types of Centrality:  Degree Centrality – Simple, the Vertex with the most degrees is the most central.  Eigenvector Centrality – How important a particular Vertex is to a given network.  PageRank – similar to Eigenvector Centrality, only scaled, and if a given vertex is closely connected to very high PageRank vertex, it is itself given a high PageRank.  Serious nutshell definitions.  Shortest path – How are two vertices connected?  Longest Path – Tracing the flow of an interesting item through a large collection of applications.
  13. 13. Why is a path important? More on this later… The Original Joke This is me in different stores
  14. 14. The Math doesn’t change.  One thing I like about Graphs –  The Math does not change.  The math behind Graph theory can be a little intense, but it does not change regardless of the scale of the graph.  Once you understand how to “do the math” on a small graph, those same Maths apply to a Graph whether it is a graph of the people in this room, or a graph of the people on this planet.  Now, let me introduce you to a tool that does much of the Mathematics for you…
  15. 15. But first, Netvizz…  Netvizz is a tool that extracts data from different sections of the Facebook Platform.  It provides an interface to the Facebook Graph API   For the version of data we will be looking at, I was able to extract friendship connections. Facebook has since changed their permissions such that you can no longer extract this information.  However, there are some other interesting things you can do with Netvizz.  If you manage a Facebook Group, this might be interesting.  For this particular talk we are going to focus on Gephi interpretation. If we want to have a more in-depth talk on Facebook and the Graph API that Facebook has opened, we can discuss that at another time.  To get this yourself go into Facebook and search for: Netvizz. (You have to authorize it. You can un-authorized it later)  You will have a number of options: group data, page data, page like network, search, and link stats.  Click “group data”  Select a group if you need a sample id use: 39462256584  It runs for a bit, then dumps to a zip file.  Save the file, then extract it.  Open Gephi, and use Gephi to import your GDF file.
  16. 16. Gephi From the website: “Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.” Java 1.7 required, you may have to set this in Gephi.conf Depending on the size of the network you are studying you may need to increase the memory available to Java in Gephi.conf
  17. 17. Gephi Startup
  18. 18. Gephi – Open GML file
  19. 19. Gephi – After opening
  20. 20. Layout
  21. 21. Behavior Options
  22. 22. After running
  23. 23. Partitioning
  24. 24. Metrics  Remember all those numbers we spoke about?  Here are many of them.
  25. 25. Data Table
  26. 26. Configure Labels
  27. 27. Here is the layout with the labels as number of connections
  28. 28. Add Background
  29. 29. Visualization File->Export-> SVG/PDF/PNG…
  30. 30. Export to Excel
  31. 31. How do we use this?  Finding bottlenecks.  You have to ignore the fact that everyone on this graph is connected to you for a moment.  How would someone get a message to another given person?  They would have to pass it to someone either they both know, or pass the message to someone who is more likely to be connected to the target of the message.  This was the heart of Milgram’s experiment that gave us the concept of 6 degrees of separation.
  32. 32. Other Analysis  What else can be done with Social Network Analysis?  How about risk exposure to banks? 
  33. 33. Application to Business Intelligence  What if the Vertices are not people ?  What if the Edges are not mutual connections?  Jonathan and others over the past few meetings have done a great job at explaining the underpinnings of how a particular BI framework is put together.  Within a Data Architecture there are lots of moving pieces. ETL, FTP, SFTP, Web-Services, External data feeds. Data moving into Data Marts, and Data Warehouses. Data Moving between applications.  Let’s imagine how to visualize this using the information we just gained.
  34. 34. Data Structure Graph  A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph.  A group of atomic entities.  Related to each other.  Stored in a repository.  Moved from one persistence layer to another.  Rendered as a Graph.
  35. 35. Introducing Data Structure Graphs  Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity Relationship Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.  Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an application. Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow diagrams.  Data Structure Graph Dependency (DSG-D) – Each vertex is a job,script, program, or process that is dependent on something happening in sequence before it can do its work.  A DSG-L1 can show you where you are going to have the most interesting query performance of your tables.  A DSG-L2 can show you where the most amount of work is going on in your Enterprise.  A DSG-D can show you the sequence of events that need to take place in order for something to be completed.
  36. 36. New Project, Data Table, Import data.
  37. 37. Load as “Edges Table” Source, Target (required)
  38. 38. Choose Create Missing Nodes
  39. 39. After a few calculations and layout runs
  40. 40. PageRank – Which application is most important?
  41. 41. A few more tweaks
  42. 42. Where is that Node with the highest PageRank?
  43. 43. Remember paths? The Original Joke This is me in different stores
  44. 44. Dijkstra's algorithm  Some of you may have heard of Dijkstra’s algorithm.  It is a method for finding the shortest path between two nodes on a Graph.  This is a great optimization technique, but what if you need to find the longest path?  What “edge_label” has the most influence on my organization?  Iterate through each Edge_Label, create a subgraph that consists of only the nodes this Edge_Label touches, then calculate the diameter of that Graph.  The data point represented by a given Edge_label that has the longest path has the most “value” to your organization.
  45. 45. Hard to see, I know, but the top diagram is the “master graph”, the bottom image is a single Edge_Label. You can see how an individual data entity flows through an organization.
  46. 46. My book Goes through a number of examples for doing an Graph analysis of a fictional organization.
  47. 47. Consider the following:  If you need assistance, send a message to the group, or contact me directly (I am easy to find @dougneedham)  Network/Graph Analysis is cool.  It can show you some interesting things about your data that you may not have considered.  Due thought should be put towards a network analysis project.  Organizing the data requires a bit of thought. (From -> To vertices is just a start).  Directed graph, undirected, bigraph? Setup work needs to be done.  Tools help with the detailed calculations, and show the paths, walks, etc.
  48. 48. What did I leave out?  Graphs that change over time – What happens when you remove a single Edge or Vertex?  Growth of a Network – Erdos-Renyi versus Barabasi-Albert models (Random versus Preferential Attachment)  Scale Free networks – Graphs that conform to Power laws. (These are intrinsically Social Networks, but I didn’t give much detail)  Comparing two networks – If you have the same number of edges and nodes, are two graphs the same? Is one graph an isomorphism of another?  Contagion – Ceteris paribus how will things(information, virus’s, data,disease…) spread through the network. (Since a DSG represents different types of Edges based on Edge_Label, Contagion should not affect this type of network entirely.)  Large Graphs – GraphX a part of Apache Spark is best used for this purpose.  The strength of Weak Ties Paradox  Social Capital
  49. 49. Finally… Want to do Data Science?  Challenge for members of the audience.  1. Download Gephi.  2. Put together a simple CSV: Source, Target,Edge_Label that describes your own data environment.  3. Load it in Gephi and have Gephi run the metrics, and perform the auto layout.  4. Answer this question: Did you get what you expected?  5. Get a colleague to do the same thing, compare the images. How similar are they?  Here is my hypothesis: If you have more than 5 data applications, including Hadoop, and Data Warehouse infrastructure, your Graph will follow the rules of preferential attachment. (To<->From ETL tools don’t count in the analysis)  Tweet me @dougneedham #DataStructureGraph (anonymized, of course.)  What does your Graph look like?
  50. 50. Final Thoughts – Questions?