Link Analysis in Networks - or - Finding the Terrorists


Published on

A rapid fire introduction to network theory and how link analysis can be used by law enforcement agencies

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Link Analysis in Networks - or - Finding the Terrorists

  1. 1. Link Analysis in Networks - or Finding The Terrorists Friday, 15 November 2013
  2. 2. About James Mathematician turned Computer Scientist lives in London, UK talks fast Works for cisco bad at blogging Friday, 15 November 2013
  3. 3. Objectives What is link analysis history lesson graph theory basics network theory concepts link analysis basics link analysis in the wild getting started with link analysis Friday, 15 November 2013
  4. 4. What is link Analysis (1) Which nodes are key or central to the network? Which links can be severed or strengthened to most effectively impede or enhance the operation of the network? Can the existence of undetected links or nodes be inferred from the known data? What types of structured groups of entities occur in the data set? Friday, 15 November 2013
  5. 5. What is link Analysis (2) What are the relevant sub-networks within a much larger network? Are there similarities in the structure of subparts of the network that can indicate an underlying relationship (e.g., modus operandi)? What data model and level of aggregation best reveal certain types of links and sub-networks? Friday, 15 November 2013
  6. 6. Organised Crime vs Terrorism Friday, 15 November 2013
  7. 7. History Friday, 15 November 2013
  8. 8. G uns, Drugs & Gangs ? - pirates, gangs, bandits and highway-robbers 4BC - Goths and Vandals 1800s - Yakuza, Triad, Mafia, Mafiya 1920s+ - La Cosa Nostra, cartels, ethnocentric gangs and syndicates, IRA 1970s - ETA 1990s - Al-Qeada 2000s - Anonymous Friday, 15 November 2013
  9. 9. Japan's three biggest banks face yakuza links inquiry Loans to mobsters scandal at Mizuho prompts wider investigation into Mitsubishi UFJ and Sumitomo Mitsui groups Friday, 15 November 2013
  10. 10. 0th Generation Friday, 15 November 2013
  11. 11. 1st Generation Generally accepted first formalisation was in 1975 with the Anacpapa Chart of Harper and Harris Friday, 15 November 2013
  12. 12. 2nd Generation GUI software that essentially replicated the manual and hand-drawn 1st generation tools, notably: • i2 • Netmap • Crimeflow Due to automated computation information could be updated in real-time Still often requires a domain expert Friday, 15 November 2013
  13. 13. 2nd Generation Friday, 15 November 2013
  14. 14. 3rd Generation do not require domain experts for usage aggregate sources - most data is digitised now rich meta-data models improved computational power and algorithms billions of nodes and relationships Friday, 15 November 2013
  15. 15. Deduction vs. Inference Friday, 15 November 2013
  16. 16. Graph Theory the basics Friday, 15 November 2013
  17. 17. Defn 1: Undirected Graph an undirected graph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the set of 2-element subsets of V called edges If E does not contain e(v1, v2) such that v1 = v2 then G is a simple graph Friday, 15 November 2013
  18. 18. Example V = { london, paris, amsterdam, madrid } E = { {london, paris}, {paris, amsterdam}, {paris, madrid} } Friday, 15 November 2013
  19. 19. Defn 3: Labels A label is some value, e.g integer, colour, enumeration An edge-labelled graph is one where some or all of the edges have labels A vertex-labelled graph is one where some or all of the vertices have labels A labelled graph maybe edge-labelled, vertexlabelled, or both Friday, 15 November 2013
  20. 20. Defn 2: Directed Graph a directed graph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the set of ordered 2-element subsets of V called edges For a vertex v the in-degree is the number of edges in E that end at v. The out-degree of v is the number of edges that start ar v Friday, 15 November 2013
  21. 21. Example Credit to scificat @ deviantart and Sheldon from the big bang theory Friday, 15 November 2013
  22. 22. Example V = { rock, scissors, paper, lizard, spock } E={ {rock, scissors}, {rock, lizard}, {scissors, paper}, {scissors, lizard}, {paper, rock}, {paper, spock}, {lizard, paper}, {lizard, spock}, {spock, rock}, {spock, scissors} } Friday, 15 November 2013
  23. 23. Defn 3: Multigraph a multigraph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the multiset of 2-element subsets of V called edges if the elements of E are ordered pairs then G is a directed multigraph Friday, 15 November 2013
  24. 24. Friday, 15 November 2013
  25. 25. Defn 4: Subgraph given a graph G(Vg, Eg) a graph H is a subgraph H(Vh, Eh) iff Vh < Vg and Eh < Eg if Vh = Vg then H is a spanning subgraph of G Friday, 15 November 2013
  26. 26. Defn 4: Walks given a graph G(V, E) a walk W is a sequence of edges from E s.t. for any adjacent elements wi = (vr, vs), wi+1 = (vt, vw) then vs = vt If a walk begins & ends on the same vertex it is a closed, otherwise it is open Friday, 15 November 2013
  27. 27. Defn 4: Cycle A closed walk is called a cycle. A cycle must have length greater than 0. Defn 4: Cyclic & Acyclic a graph g is said to be acyclic iff there is no subgraph which is a cycle graph Friday, 15 November 2013
  28. 28. Defn 4: Complete Graph A graph G(V, E) with |V| = n is a complete graph Kn if for every vertex vi there exists an edge (vi, vk) in E for k = 1..n, and i ≠ k Defn 4: Cliques Given a graph G(Vg, Eg) and a subgraph H(Vh, Eh), |Vh| = k, if H is a complete graph then H is a clique of order k, or a k-clique Friday, 15 November 2013
  29. 29. Examples Friday, 15 November 2013
  30. 30. Defn 5: Strongly Connected A graph G is strongly connected iff for every pair of vertices {vi, vj} in G there exists a path which starts at vi and ends at vj Given a graph G and a subgraph H, if H is maximally strongly connected we call H a strongly connected component of G Friday, 15 November 2013
  31. 31. Network Theory basic Concepts Friday, 15 November 2013
  32. 32. Communities A network is said to have community structure if the nodes can be grouped into (potentially overlapping) subgraphs such that each is densely connected. Methods for finding communities: minimum-cut method hierarchial clustering Girvan-newman algorithm modularity maximisation clique analysis Friday, 15 November 2013
  33. 33. Small Worlds A small-world network is a graph G(V, E) where the average minimum path length between any two vertices is L where L α log |V| Small-worlds are typically comprised of cliques and near-cliques Friday, 15 November 2013
  34. 34. Random Graphs Erdős and Renyi studied properties of random graphs in 1959 A random graph G is a graph G(V, E) where the probability an edge (vi, vj) exists is given by p => the average degree k is approx. p * |V| Friday, 15 November 2013
  35. 35. Friday, 15 November 2013
  36. 36. Friday, 15 November 2013
  37. 37. if k < 1 small isolated clusters small diameters short average path lengths if k = 1 one dominant cluster appears diameter peaks high average path lengths if k > 1 approaches single strongly connected component diameter decreases average path lengths decrease Friday, 15 November 2013
  38. 38. If the relationships between people in the real world can be modelled by a random graph then because the average person knows more than 1 other (k >> 1) then the majority of people are connected by short paths Friday, 15 November 2013
  39. 39. If the relationships between people in the real world can be modelled by a random graph then because the average person knows more than 1 other (k >> 1) then the majority of people are connected by short paths Friday, 15 November 2013
  40. 40. Alpha Model Watt (1998) proposed the α-model of networks The α-model corrects the following in the random model: Relationships generally aren’t random Relationships are often “tit for tat” Relationships usually form clusters Friday, 15 November 2013
  41. 41. Friday, 15 November 2013
  42. 42. Beta Model The α-model is a significantly better model of real world network but it too has limitations Primary limitation is that the chance of distant or random connections is unrealistically low Watts and Strogatz (1999) propsed the β-model to correct this For a range of value of β these networks exhibit “small world” properties Friday, 15 November 2013
  43. 43. Scale-Free Networks Discovered in 1965 but little interest until 1999 when realised how accurately they modelled many real-world networks Consider a random graph with the following degree distribution depending on two values α and β. Suppose there are y vertices of degree x where x and y satisfy log y = α - (β log x) Friday, 15 November 2013
  44. 44. Power Law Distribution Friday, 15 November 2013
  45. 45. Random vs. Scale-Free Random graph Friday, 15 November 2013 Scale-free graph
  46. 46. Scale-Free Properties Scale-free graphs are small-worlds The number of vertices with higher degree than the average is very common Such vertices are called hubs Primary hubs are supported by secondaries, tertiary, etc Thus scale-free networks are fault-tolerant Vertices tend to form communities with hubs providing inter-community connection Friday, 15 November 2013
  47. 47. Link Analysis an introduction Friday, 15 November 2013
  48. 48. Link analysis and network theory provide techniques for analysing structure in a system of interacting agents, represented as a network Most well known examples are web search engines: HITS (Hypertext Induced Topic Search) - PageRank - Google TrustRank - Yahoo! Friday, 15 November 2013
  49. 49. Matrices row vector (1 x n) square matrix (2 x 2) Friday, 15 November 2013 column vector (n x 1) rectangular matrix (2 x 3)
  50. 50. Matrix Operations addition scalar multiplication transpose Friday, 15 November 2013
  51. 51. Matrix Multiplication Given an n*m matrix P, and an m*o matrix Q then PQ is the n*o matrix where each element PQij is given by for example: Friday, 15 November 2013
  52. 52. Noncommutative: Associative: Distributive over matrix addition: Scalar multiplication is associative over matrix multiplication: Transpose: Friday, 15 November 2013
  53. 53. Eigenvectors given a square n*n matrix A and a non-zero nvector v, v is a (right) eigenvector of A iff we call λ an eigenvalue. Eigenvalues & eigenvectors can be real or complex Friday, 15 November 2013
  54. 54. Example Friday, 15 November 2013
  55. 55. Incidence Matrix Rock Scissors Paper Lizard Spock Rock 0 0 1 0 1 Scissors 1 0 0 0 1 Paper 0 1 0 1 0 Lizard 1 1 0 0 0 Spock 0 0 1 1 0 Friday, 15 November 2013
  56. 56. Other Representations Adjacency matrix Incidence list Adjacency list Edge lists Topological distance matrix Friday, 15 November 2013
  57. 57. Centrality Centrality is a measure of how luminous a given vertex in a graph is In an undirected graph centrality measures consider all edges In a directed graph centrality measures consider only out-edges Friday, 15 November 2013
  58. 58. For a graph G(V, E), and P the set of all paths in G we define: Degree-centrality Closeness-centrality Betweenness-centrality Friday, 15 November 2013
  59. 59. Prestige Prestige is a measure of how visible a given vertex in a graph is. In undirected graphs we consider all edges, but in directed graphs prestige considers in-edges only Similar to centrality metrics, we have degreeprestige, proximity prestige and rank prestige Friday, 15 November 2013
  60. 60. PageRank “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites” [Facts about Google and Competition] Friday, 15 November 2013
  61. 61. Simple PageRank Given the web is a graph G(V, E) where |V| = n, i.e. n pages, the PageRank of page vi is and the initial rank of page vi is Friday, 15 November 2013
  62. 62. Given the adjacency matrix M of a graph G(V, E) we construct the hyperlink matrix H such that Note that H is normalised, i.e each column sums to 1, and all entries are non-negative. H is said to be a stochastic matrix Friday, 15 November 2013
  63. 63. The PageRank vector R of a graph G(V,E) with hyperlink matrix H is given by That is PageRank is the primary eigenvector of H. We can iteratively calculate R using the power method Friday, 15 November 2013
  64. 64. Example Friday, 15 November 2013
  65. 65. Friday, 15 November 2013
  66. 66. Friday, 15 November 2013
  67. 67. Friday, 15 November 2013
  68. 68. Simple PageRank Issues Dangling pages Orphan pages Cycles Rank sinks Sensitivity to initial PageRank vector Friday, 15 November 2013
  69. 69. Real PageRank use sparse matrix representations up to 3 billion rows and columns if probability of teleportation is > 0.15 PageRank converges in less than 100 iterations may use alternative to “random surfer” model Friday, 15 November 2013
  70. 70. Link Analysis in the Wild Friday, 15 November 2013
  71. 71. How The NSA Works The methods used by the NSA Prism project include: Blah blah blah blah blah Blahblahblah and by the name of Blah blah blah blah blah blah balh and trained black-ops dolphins Friday, 15 November 2013
  72. 72. Knowledge Mash-Ups Multiple data sources full text search social graphs telephone, email & browsing history Representations may not be appropriate for analysis Data may need to be transformed and managed using non-relational data structures Important to remember non-mathematicians analysts prefer to work with visual Friday, 15 November 2013
  73. 73. TerroristRank TerroristRank works by counting the number and quality of links to a person to determine a rough estimate of how important the person is. The underlying assumption is that more important terrorists are likely to receive more links from other terrorists Friday, 15 November 2013
  74. 74. How to Find a Terrorist Given a graph of actors and their interactions determine the communities extract the subgraph of communities containing the actors of interest calculate the “terrorist rank” of the subgraph actors with the highest ranks are “suspects” Friday, 15 November 2013
  75. 75. Limitations typically graph algorithms are non-linear in time and/or space complexity adding new nodes & edges can have a dramatic impact real world networks are often dynamic metrics like rank must be constantly recalculated Friday, 15 November 2013
  76. 76. Issues Information overload data mining maybe incomplete or find false positive relationships intentional or subconscious human filtering of data sources malicious data (e.g SEO of malware by link sites) changes in alleigence (“when good goes bad”) Friday, 15 November 2013
  77. 77. Deduction Prosecution vs. Inference Prevention Friday, 15 November 2013
  78. 78. Getting Started with Graph Algorithms Friday, 15 November 2013
  79. 79. Matlab a mathematical workbench aimed at scientists & mathematicians not developers has graph algorithms “plugin” gaimc (generally) slower than native code has Apis for bi-directional integration with Java good for learning the mathematics behind algorithms Friday, 15 November 2013
  80. 80. Cern Colt http:/ / high performance data structures and operations collections (including primitive templated lists & maps) matrices linear algebra mathematics & statistics random sampling & number generation documentation a bit hit and miss api not always natural to java programmers Friday, 15 November 2013
  81. 81. Jung 2.0 http:/ / pure java graph algorithm library OO Api uses cern colt library for matrix representations and operations uses in-memory storage performance limitations are generally related to this easy to extend requires good understanding of algorithms and internals to maximise performance Friday, 15 November 2013
  82. 82. Neo4j transactional property graph database (acid compliant) property graphs are: labelled directed multigraphs both vertices and edges can have any number of key/value properties associated with them. numerous in-built algorithms powerful flexible query language allows implementation of algorithms Community version limits total number of nodes excellent spring integration Mark needham http:/ / Friday, 15 November 2013
  83. 83. Alternatively Hadoop & map-reduce gremlin - a groovy based graph DSL scala-graph - young & operator-overloading hell R programming language Friday, 15 November 2013
  84. 84. Summary Friday, 15 November 2013
  85. 85. Objectives Review What is link analysis history lesson graph theory basics network theory basics link analysis basics link analysis in the wild getting started with link analysis Friday, 15 November 2013
  86. 86. Link analysis is just one tool in the box for extracting information from graphs much of the skill lies in: filtering the raw graph to prevent information overload pruning the graph to allow expensive algorithms to compute an effective answer work iteratively; start by extracting simple data from a graph before trying, say, community analysis understand enough of the underlying mathematics to choose the right tool for the job Friday, 15 November 2013
  87. 87. Thank You Friday, 15 November 2013