Advertisement

Social network analysis

Data Software Engineer
Jul. 26, 2013
Advertisement

More Related Content

Viewers also liked(20)

Advertisement
Advertisement

Social network analysis

  1. SOCIAL NETWORK ANALYSIS Caleb Jones { “email” : “calebjones@gmail.com”, “website” : “http://calebjones.info”, “twitter” : “@JonesWCaleb” }
  2. Overview •  Network Analysis – Crash Course •  Degree •  Components •  Modularity •  Ranking •  Resiliency •  Gephi – Intro •  Loading data (Facebook) •  Navigation •  Statistics •  Exporting •  Filtering •  Resiliency
  3. Resources SNA Coursera Course (next being taught October 2013) Linked by Albert-László Barabási
  4. Network Analysis – Crash Course •  Degree (n): The number of connections a node has. •  Node A has in-degree 3 and out-degree 1 •  Node B has degree 4 A B
  5. Network Analysis – Crash Course •  Component (n): A a maximally connected subgraph (undirected). •  Giant component is largest component component (giant) component Graph with nodes { A, B, C, X, Y, Z }
  6. Network Analysis – Crash Course •  Modularity (n) ~ Division of a graph into communities (modules/classes/cliques) with dense interconnection with the network having relatively sparse interconnection between communities. Community 1 Community 2 Graph with nodes { A, B, C, X, Y, Z }
  7. Network Analysis – Crash Course • Ranking: A measure of a node’s “importance” • Many different methods for determining “importance” • Degree, Centrality, Closeness, Betweenness, Eigenvector, HITS, PageRank, Erdös Number • Which one to consider depends on the question being asked • Precursor to identifying network resilience, diffusion, and vulnerability
  8. Network Analysis – Crash Course • Degree ranking: Quantity over quality Node Score A 3 B 3 C 1 D 1 X 1 Y 1 Z 3 Q 1
  9. Network Analysis – Crash Course • Betweeness Ranking: How frequently a node appears on shortest paths. Node Score A 15 B 11 C 0 D 0 X 0 Y 0 Z 11 Q 0
  10. Network Analysis – Crash Course • Closeness Ranking: Average number of hops from a node to rest of network. Node Score A 1.571 B 1.857 C 2.714 D 2.714 X 2.714 Y 2.714 Z 1.857 Q 2.429 Note: Smaller is (usually) better
  11. Network Analysis – Crash Course • Eigenvector Ranking: A node’s “influence” on the network (accounts for who you know) Node Score A 1 B 0.836 C 0.392 D 0.392 X 0.392 Y 0.392 Z 0.836 Q 0.465 Google’s PageRank is a variant of this Based on eigenvector of adjacency matrix
  12. Network Analysis – Crash Course • Erdös Ranking: Number of hops to specific node (degrees of separation). Node Score A 0 B 1 C 2 D 2 X 2 Y 2 Z 1 Q 1 Note: Smaller is (usually) better What if “Erdös” is an influential CEO? What if “Erdös” has bird flu? Erdös
  13. Network Analysis – Crash Course • Erdös Ranking: Number of hops to specific node (degrees of separation). Node Score A 2 B 1 C 2 D 0 X 4 Y 4 Z 3 Q 3 Note: Smaller is (usually) better What if “Erdös” is an influential CEO? What if “Erdös” has bird flu? Erdös
  14. Network Analysis – Crash Course • Limitations: • Only considered undirected networks (directed is more complicated) • Treated all edges as equal. Many networks have a weight or cost associated to edges (e.g. distance) • Treated all nodes as equal. A node’s importance may be inherent based on attributes separate from its position in network (e.g. dating sites)
  15. Network Analysis – Crash Course • Resiliency (removing nodes/links): • Target nodes based on their “importance” • High degree nodes more likely to affect local communities • High betweeness/Eigenvector nodes more likely to fragment communities
  16. Gephi Introduction •  Platform for visualizing and analyzing networks •  https://gephi.org/ •  Cross-platform •  Plugin model
  17. Facebook Dataset •  Download your data (gml) •  http://snacourse.com/getnet/ •  Import into Gephi •  File -> Open -> Select downloaded .gml file •  Choose “undirected” for “Graph Type”
  18. Layout Layout -> Fruchterman Reingold
  19. Partitioning Communities 1.  Statistic -> Modularity -> Run (use defaults) 2.  Partition -> Nodes (refresh) -> Modularity class -> Apply
  20. Degree Distribution 1.  Statistic -> Average Degree -> Run 2.  Partition -> Nodes (refresh) -> Modularity class -> Apply Lots of nodes with few connections Only a few with a large number of connections Power law distribution?
  21. Node Ranking by Degree 1.  Ranking -> Nodes (refresh) -> Degree -> Apply (try tweaking min/max size and Spline for desired emphasis)
  22. Filtering Isolated Nodes (“noise”) 1.  Statistics -> Connected Components -> Run 2.  Filters -> Attributes -> Partition Count -> Component ID 3.  Drag “Component ID” down into “Queries” section 4.  Click on “Partition Count”, slide the settings bar, and click “Filter” – adjust to remove isolated nodes Can be important step when dealing with very large data sets. Depending on degree distribution, filter can be set quite high.
  23. Re-adjust after Filtering • Need to re-run previous steps to refresh calculated values now that filtering has been done. • Statistics -> Average degree, modularity, connected components •  How did these numbers change? • Re-partition node color by modularity class now that modularity has been recalculated • Run Fruchterman Reingold layout again to fill space left over from filtered nodes
  24. Have you saved yet!?
  25. Node Ranking by Centrality 1.  Statistics -> Network Diameter -> Run 2.  Ranking -> Betweeness Centrality -> Apply
  26. Erdös Number •  You may have noticed a key node which both has the highest degree and betweeness ranking. •  Click on the “Edit” button and select that node (note the name) •  Statistics -> Erdös Number -> Select that name -> OK •  What will happen if you select a less conspicuous node?
  27. Data Lab •  Go to “Data Laboratory” •  All node information as well as calculated statistics appear here in a spreadsheet. •  Sort by “Erdös Number” (descending) •  What is the largest Erdös Number? N degrees of ________ . •  Try sorting by other values (degree, closeness, betweeness) Max is 7 degrees of separation
  28. Node Ranking by Eigenvector Centrality 1.  Statistics -> Eigenvector Centrality -> Run 2.  Ranking -> Eigenvector Centrality -> Apply
  29. Node Ranking by PageRank 1.  Statistics -> PageRank -> Run 2.  Ranking -> PageRank -> Apply
  30. Export to Image •  Go to “Preview” mode •  Click “Refresh” to see what you have now •  Add node labels •  “Node Labels” -> “Show Labels” •  Adjust font size to avoid label overlapping •  If Node Labels are overlapping, try expanding layout •  Back to “Overview” -> Layout -> Fruchterman Reingold •  Increase the “Area” parameter and re-run the layout •  Then go back to “Preview” mode and click “Refresh” •  May need to re-adjust Node Label text size •  Experiment with “Curved” edges
  31. labels omitted in slidedeck for privacy
  32. Before we attack the network, save!
  33. Network Resiliency •  How can we fragment the network or increase the separation between nodes? •  Which nodes, if removed/influenced, would most greatly impact the network? •  What information have we learned already that could be used?
  34. Network Resiliency •  Go to “Data Laboratory” -> sort by “PageRank descending •  Select top 5 rows and delete them (did you save first!!!) •  Note their names – Are these people influential in your life? sort Top 5
  35. Network Resiliency •  Go back to statistics and note the following: •  Average Degree, Network Diameter, Modularity, Connected Components, Average Path Length •  Also note how the network visually has changed •  Re-run the statistics above and note how the numbers changed •  Did you successfully fragment the network (did # of connected components increase)? (disrupting communications) •  How many nodes do you think you’d have to remove if you removed by lowest PageRank scores first? (robustness of network) •  What if links represented load distributed across network? How would the network load change after removing these key nodes? (cascading failure)
  36. Review •  Network Analysis – Crash Course •  Degree •  Components •  Modularity •  Ranking •  Resiliency •  Gephi – Intro •  Loading data (Facebook) •  Navigation •  Statistics •  Exporting •  Filtering •  Resiliency
  37. Questions?
Advertisement