Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Data Mining and Homeland Security: ... by Tommy96 1285 views
- Web Services-Enhanced Agile Modelin... by Mustafa Salam 247 views
- Link analysis .. Data Mining by Mustafa Salam 1565 views
- Data Mining: Text and web mining by DataminingTools Inc 13005 views
- Data mining by Akannsha Totewar 127693 views
- Network properties by Chris Saunders 325 views

1,601 views

Published on

Published in:
Technology

No Downloads

Total views

1,601

On SlideShare

0

From Embeds

0

Number of Embeds

99

Shares

0

Downloads

2

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Link Analysis in Networks - or Finding The Terrorists Friday, 15 November 2013
- 2. About James Mathematician turned Computer Scientist lives in London, UK talks fast Works for cisco bad at blogging Friday, 15 November 2013
- 3. Objectives What is link analysis history lesson graph theory basics network theory concepts link analysis basics link analysis in the wild getting started with link analysis Friday, 15 November 2013
- 4. What is link Analysis (1) Which nodes are key or central to the network? Which links can be severed or strengthened to most effectively impede or enhance the operation of the network? Can the existence of undetected links or nodes be inferred from the known data? What types of structured groups of entities occur in the data set? Friday, 15 November 2013
- 5. What is link Analysis (2) What are the relevant sub-networks within a much larger network? Are there similarities in the structure of subparts of the network that can indicate an underlying relationship (e.g., modus operandi)? What data model and level of aggregation best reveal certain types of links and sub-networks? Friday, 15 November 2013
- 6. Organised Crime vs Terrorism Friday, 15 November 2013
- 7. History Friday, 15 November 2013
- 8. G uns, Drugs & Gangs ? - pirates, gangs, bandits and highway-robbers 4BC - Goths and Vandals 1800s - Yakuza, Triad, Maﬁa, Maﬁya 1920s+ - La Cosa Nostra, cartels, ethnocentric gangs and syndicates, IRA 1970s - ETA 1990s - Al-Qeada 2000s - Anonymous Friday, 15 November 2013
- 9. Japan's three biggest banks face yakuza links inquiry Loans to mobsters scandal at Mizuho prompts wider investigation into Mitsubishi UFJ and Sumitomo Mitsui groups http://www.theguardian.com/world/2013/oct/30/japan-three-biggestbanks-yakuza-links-inquiry Friday, 15 November 2013
- 10. 0th Generation Friday, 15 November 2013
- 11. 1st Generation Generally accepted ﬁrst formalisation was in 1975 with the Anacpapa Chart of Harper and Harris Friday, 15 November 2013
- 12. 2nd Generation GUI software that essentially replicated the manual and hand-drawn 1st generation tools, notably: • i2 • Netmap • Crimeﬂow Due to automated computation information could be updated in real-time Still often requires a domain expert Friday, 15 November 2013
- 13. 2nd Generation Friday, 15 November 2013
- 14. 3rd Generation do not require domain experts for usage aggregate sources - most data is digitised now rich meta-data models improved computational power and algorithms billions of nodes and relationships Friday, 15 November 2013
- 15. Deduction vs. Inference Friday, 15 November 2013
- 16. Graph Theory the basics Friday, 15 November 2013
- 17. Defn 1: Undirected Graph an undirected graph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the set of 2-element subsets of V called edges If E does not contain e(v1, v2) such that v1 = v2 then G is a simple graph Friday, 15 November 2013
- 18. Example V = { london, paris, amsterdam, madrid } E = { {london, paris}, {paris, amsterdam}, {paris, madrid} } Friday, 15 November 2013
- 19. Defn 3: Labels A label is some value, e.g integer, colour, enumeration An edge-labelled graph is one where some or all of the edges have labels A vertex-labelled graph is one where some or all of the vertices have labels A labelled graph maybe edge-labelled, vertexlabelled, or both Friday, 15 November 2013
- 20. Defn 2: Directed Graph a directed graph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the set of ordered 2-element subsets of V called edges For a vertex v the in-degree is the number of edges in E that end at v. The out-degree of v is the number of edges that start ar v Friday, 15 November 2013
- 21. Example Credit to sciﬁcat @ deviantart and Sheldon from the big bang theory Friday, 15 November 2013
- 22. Example V = { rock, scissors, paper, lizard, spock } E={ {rock, scissors}, {rock, lizard}, {scissors, paper}, {scissors, lizard}, {paper, rock}, {paper, spock}, {lizard, paper}, {lizard, spock}, {spock, rock}, {spock, scissors} } Friday, 15 November 2013
- 23. Defn 3: Multigraph a multigraph, G, is an ordered pair G(V, E) where V is a set of objects called vertices E is the multiset of 2-element subsets of V called edges if the elements of E are ordered pairs then G is a directed multigraph Friday, 15 November 2013
- 24. Friday, 15 November 2013
- 25. Defn 4: Subgraph given a graph G(Vg, Eg) a graph H is a subgraph H(Vh, Eh) iff Vh < Vg and Eh < Eg if Vh = Vg then H is a spanning subgraph of G Friday, 15 November 2013
- 26. Defn 4: Walks given a graph G(V, E) a walk W is a sequence of edges from E s.t. for any adjacent elements wi = (vr, vs), wi+1 = (vt, vw) then vs = vt If a walk begins & ends on the same vertex it is a closed, otherwise it is open Friday, 15 November 2013
- 27. Defn 4: Cycle A closed walk is called a cycle. A cycle must have length greater than 0. Defn 4: Cyclic & Acyclic a graph g is said to be acyclic iff there is no subgraph which is a cycle graph Friday, 15 November 2013
- 28. Defn 4: Complete Graph A graph G(V, E) with |V| = n is a complete graph Kn if for every vertex vi there exists an edge (vi, vk) in E for k = 1..n, and i ≠ k Defn 4: Cliques Given a graph G(Vg, Eg) and a subgraph H(Vh, Eh), |Vh| = k, if H is a complete graph then H is a clique of order k, or a k-clique Friday, 15 November 2013
- 29. Examples Friday, 15 November 2013
- 30. Defn 5: Strongly Connected A graph G is strongly connected iff for every pair of vertices {vi, vj} in G there exists a path which starts at vi and ends at vj Given a graph G and a subgraph H, if H is maximally strongly connected we call H a strongly connected component of G Friday, 15 November 2013
- 31. Network Theory basic Concepts Friday, 15 November 2013
- 32. Communities A network is said to have community structure if the nodes can be grouped into (potentially overlapping) subgraphs such that each is densely connected. Methods for ﬁnding communities: minimum-cut method hierarchial clustering Girvan-newman algorithm modularity maximisation clique analysis Friday, 15 November 2013
- 33. Small Worlds A small-world network is a graph G(V, E) where the average minimum path length between any two vertices is L where L α log |V| Small-worlds are typically comprised of cliques and near-cliques Friday, 15 November 2013
- 34. Random Graphs Erdős and Renyi studied properties of random graphs in 1959 A random graph G is a graph G(V, E) where the probability an edge (vi, vj) exists is given by p => the average degree k is approx. p * |V| Friday, 15 November 2013
- 35. Friday, 15 November 2013
- 36. Friday, 15 November 2013
- 37. if k < 1 small isolated clusters small diameters short average path lengths if k = 1 one dominant cluster appears diameter peaks high average path lengths if k > 1 approaches single strongly connected component diameter decreases average path lengths decrease Friday, 15 November 2013
- 38. If the relationships between people in the real world can be modelled by a random graph then because the average person knows more than 1 other (k >> 1) then the majority of people are connected by short paths Friday, 15 November 2013
- 39. If the relationships between people in the real world can be modelled by a random graph then because the average person knows more than 1 other (k >> 1) then the majority of people are connected by short paths Friday, 15 November 2013
- 40. Alpha Model Watt (1998) proposed the α-model of networks The α-model corrects the following in the random model: Relationships generally aren’t random Relationships are often “tit for tat” Relationships usually form clusters Friday, 15 November 2013
- 41. Friday, 15 November 2013
- 42. Beta Model The α-model is a signiﬁcantly better model of real world network but it too has limitations Primary limitation is that the chance of distant or random connections is unrealistically low Watts and Strogatz (1999) propsed the β-model to correct this For a range of value of β these networks exhibit “small world” properties Friday, 15 November 2013
- 43. Scale-Free Networks Discovered in 1965 but little interest until 1999 when realised how accurately they modelled many real-world networks Consider a random graph with the following degree distribution depending on two values α and β. Suppose there are y vertices of degree x where x and y satisfy log y = α - (β log x) Friday, 15 November 2013
- 44. Power Law Distribution Friday, 15 November 2013
- 45. Random vs. Scale-Free Random graph Friday, 15 November 2013 Scale-free graph
- 46. Scale-Free Properties Scale-free graphs are small-worlds The number of vertices with higher degree than the average is very common Such vertices are called hubs Primary hubs are supported by secondaries, tertiary, etc Thus scale-free networks are fault-tolerant Vertices tend to form communities with hubs providing inter-community connection Friday, 15 November 2013
- 47. Link Analysis an introduction Friday, 15 November 2013
- 48. Link analysis and network theory provide techniques for analysing structure in a system of interacting agents, represented as a network Most well known examples are web search engines: HITS (Hypertext Induced Topic Search) - ask.com PageRank - Google TrustRank - Yahoo! Friday, 15 November 2013
- 49. Matrices row vector (1 x n) square matrix (2 x 2) Friday, 15 November 2013 column vector (n x 1) rectangular matrix (2 x 3)
- 50. Matrix Operations addition scalar multiplication transpose Friday, 15 November 2013
- 51. Matrix Multiplication Given an n*m matrix P, and an m*o matrix Q then PQ is the n*o matrix where each element PQij is given by for example: Friday, 15 November 2013
- 52. Noncommutative: Associative: Distributive over matrix addition: Scalar multiplication is associative over matrix multiplication: Transpose: Friday, 15 November 2013
- 53. Eigenvectors given a square n*n matrix A and a non-zero nvector v, v is a (right) eigenvector of A iff we call λ an eigenvalue. Eigenvalues & eigenvectors can be real or complex Friday, 15 November 2013
- 54. Example Friday, 15 November 2013
- 55. Incidence Matrix Rock Scissors Paper Lizard Spock Rock 0 0 1 0 1 Scissors 1 0 0 0 1 Paper 0 1 0 1 0 Lizard 1 1 0 0 0 Spock 0 0 1 1 0 Friday, 15 November 2013
- 56. Other Representations Adjacency matrix Incidence list Adjacency list Edge lists Topological distance matrix Friday, 15 November 2013
- 57. Centrality Centrality is a measure of how luminous a given vertex in a graph is In an undirected graph centrality measures consider all edges In a directed graph centrality measures consider only out-edges Friday, 15 November 2013
- 58. For a graph G(V, E), and P the set of all paths in G we deﬁne: Degree-centrality Closeness-centrality Betweenness-centrality Friday, 15 November 2013
- 59. Prestige Prestige is a measure of how visible a given vertex in a graph is. In undirected graphs we consider all edges, but in directed graphs prestige considers in-edges only Similar to centrality metrics, we have degreeprestige, proximity prestige and rank prestige Friday, 15 November 2013
- 60. PageRank “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites” [Facts about Google and Competition] Friday, 15 November 2013
- 61. Simple PageRank Given the web is a graph G(V, E) where |V| = n, i.e. n pages, the PageRank of page vi is and the initial rank of page vi is Friday, 15 November 2013
- 62. Given the adjacency matrix M of a graph G(V, E) we construct the hyperlink matrix H such that Note that H is normalised, i.e each column sums to 1, and all entries are non-negative. H is said to be a stochastic matrix Friday, 15 November 2013
- 63. The PageRank vector R of a graph G(V,E) with hyperlink matrix H is given by That is PageRank is the primary eigenvector of H. We can iteratively calculate R using the power method Friday, 15 November 2013
- 64. Example Friday, 15 November 2013
- 65. Friday, 15 November 2013
- 66. Friday, 15 November 2013
- 67. Friday, 15 November 2013
- 68. Simple PageRank Issues Dangling pages Orphan pages Cycles Rank sinks Sensitivity to initial PageRank vector Friday, 15 November 2013
- 69. Real PageRank use sparse matrix representations up to 3 billion rows and columns if probability of teleportation is > 0.15 PageRank converges in less than 100 iterations may use alternative to “random surfer” model Friday, 15 November 2013
- 70. Link Analysis in the Wild Friday, 15 November 2013
- 71. How The NSA Works The methods used by the NSA Prism project include: Blah blah blah blah blah Blahblahblah and by the name of Blah blah blah blah blah blah balh and trained black-ops dolphins Friday, 15 November 2013
- 72. Knowledge Mash-Ups Multiple data sources full text search social graphs telephone, email & browsing history Representations may not be appropriate for analysis Data may need to be transformed and managed using non-relational data structures Important to remember non-mathematicians analysts prefer to work with visual Friday, 15 November 2013
- 73. TerroristRank TerroristRank works by counting the number and quality of links to a person to determine a rough estimate of how important the person is. The underlying assumption is that more important terrorists are likely to receive more links from other terrorists Friday, 15 November 2013
- 74. How to Find a Terrorist Given a graph of actors and their interactions determine the communities extract the subgraph of communities containing the actors of interest calculate the “terrorist rank” of the subgraph actors with the highest ranks are “suspects” Friday, 15 November 2013
- 75. Limitations typically graph algorithms are non-linear in time and/or space complexity adding new nodes & edges can have a dramatic impact real world networks are often dynamic metrics like rank must be constantly recalculated Friday, 15 November 2013
- 76. Issues Information overload data mining maybe incomplete or ﬁnd false positive relationships intentional or subconscious human ﬁltering of data sources malicious data (e.g SEO of malware by link sites) changes in alleigence (“when good goes bad”) Friday, 15 November 2013
- 77. Deduction Prosecution vs. Inference Prevention Friday, 15 November 2013
- 78. Getting Started with Graph Algorithms Friday, 15 November 2013
- 79. Matlab a mathematical workbench aimed at scientists & mathematicians not developers has graph algorithms “plugin” gaimc (generally) slower than native code has Apis for bi-directional integration with Java good for learning the mathematics behind algorithms Friday, 15 November 2013
- 80. Cern Colt http:/ /acs.lbl.gov/software/colt/ high performance data structures and operations collections (including primitive templated lists & maps) matrices linear algebra mathematics & statistics random sampling & number generation documentation a bit hit and miss api not always natural to java programmers Friday, 15 November 2013
- 81. Jung 2.0 http:/ /jung.sourceforge.net/ pure java graph algorithm library OO Api uses cern colt library for matrix representations and operations uses in-memory storage performance limitations are generally related to this easy to extend requires good understanding of algorithms and internals to maximise performance Friday, 15 November 2013
- 82. Neo4j transactional property graph database (acid compliant) property graphs are: labelled directed multigraphs both vertices and edges can have any number of key/value properties associated with them. numerous in-built algorithms powerful ﬂexible query language allows implementation of algorithms Community version limits total number of nodes excellent spring integration Mark needham http:/ /www.markhneedham.com/blog/ Friday, 15 November 2013
- 83. Alternatively Hadoop & map-reduce gremlin - a groovy based graph DSL scala-graph - young & operator-overloading hell R programming language Friday, 15 November 2013
- 84. Summary Friday, 15 November 2013
- 85. Objectives Review What is link analysis history lesson graph theory basics network theory basics link analysis basics link analysis in the wild getting started with link analysis Friday, 15 November 2013
- 86. Link analysis is just one tool in the box for extracting information from graphs much of the skill lies in: ﬁltering the raw graph to prevent information overload pruning the graph to allow expensive algorithms to compute an effective answer work iteratively; start by extracting simple data from a graph before trying, say, community analysis understand enough of the underlying mathematics to choose the right tool for the job Friday, 15 November 2013
- 87. Thank You Friday, 15 November 2013

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment