Building graphs to discover information by David Martínez at Big Data Spain 2015

Building graphs to
discover information
Dr. David Martinez Rego
Big Data Spain 2015

Data science?
• Every once in a while you hear the same question
in the ofﬁce, discussion,…
• But.. what is a data scientist?
• Of course the response is usually vague, but my
deﬁnition (from the ML point of view)
• Do whatever you can to transform raw data in
information that carries some business value?

Acme project: Day 1
• After the ﬁrst handshake, reality!
• The team is usually handed data which has not
been prepared for any learnable task
• The aim (BV) is not clear or present at all
• Many books talking about design strategies
• Context, need, vision, outcome (Max Shron)

Labels?
Prior knowledge?
Exploration?
Scale?

What can I do?
• Find structure in the information
• Connected components
• Hubs
• Infer information (more on this later…)
• Clustering
• Classiﬁcation
• Anomaly detection
• Many more…

Anywhere?
• Scalable graph algorithms are behind many of the recent biggest
developments
• PageRank, social networks, medical research, DevOps, …
• So, is there an extra mile?
• All these cases have something in common, they take the graph
for granted,
• it is already given by the problem
• it is highly sparse
• it carries business value

Anywhere?
• What about the case where the graph is not
explicit
• That case carries more work, since it we have to
ﬁgure out
• how to encode individuals in a way that the
graph carries the information we want
• how to build the graph itself is a challenging
problem!!

Anywhere?
• Naïve algorithm!
• for i=1..N
• for j=1..N
• M[i,j] = sim(d[i], d[j])
• prune(graph)
If we have around 1
million entities, the
calculation is way bigger
than the whole set of
tweets in a year.

Wiser options
• So then, we need techniques that allow us to calculate the k-nn
graph without having to calculate the whole adjacency matrix
• Not possible to it exactly, but possible with an error margin
• Local Sensitivity Hashing: For some speciﬁc metrics such as
euclidean, hamming, L1, some edit distances through
embeddings.
• Semantic Hashing: when the notion of metric/similarity is not
clear, can work very well although no theoretical guarantees.
• Main idea of both —> Create a hash function such that for similar
items we create collisions with high probability and for dissimilar
items, collisions are unlikely.

LSH
8 Kristen Grauman and Rob Fergus
D t b
hr r
n
Database
hr1…rb
<< n
XiSeries of b
randomized LSH
functions
Colliding instances
are searched
110101
<< n
Q
functions
Q
111101
110111
hr1…rb
Hash table:
Similar instances collide, w.h.p.
Query
Fig. 5 Locality Sensitive Hashing (LSH) uses hash keys constructed so as to guarantee collision is
more likely for more similar examples [33, 23]. Once all database items have been hashed into the
table(s), the same randomized functions are applied to novel queries. One exhaustively searches
only those examples with which the query collides.
counters some of these shortcomings, and allows a user to explicitly control the
similarity search accuracy and search time tradeoff [23].
Kristen Grauman & Rob Fergus
Local Sensitivity Hashing

LSH
• LSH relies on the existence of LSH Family of functions
for a given metric.
• A family H is (R, cR, P1, P2)-sensitive if for any two
points p and q!
• if |p-q| < R, then P[h(p) == h(q)] > P1
• if |p-q| > cR, then P[h(p) == h(q)] < P2
where h is independently randomly selected from
family H and P1 > P2

LSH
• The effect emerges from basic probability phenomenon
• For a hash function length m, there is a p1
m
probability of two close
points to collide
• On the other hand, the probability of far apart points to collide is p2
m
• If p1 > p2 then the gap would increase with moderate code sizes
• Unfortunately, when designing the LSH we can not always achieve a
high p1
• Build several tables with the same strategy so the probability of
ﬁnding an approximate nearest neighbour increases by a union
bound.

LSH
input set into a bucket gj(p), for j = 1,…,L. Since the total number of
buckets may be large, we retain only the nonempty buckets by resort-
ing to (standard) hashing3
of the values gj(p). In this way, the data
structure uses only O(nL) memory cells; note that it suffices that the
buckets store the pointers to data points, not the points themselves.
To process a query q, we scan through the buckets g1(q),…, gL(q), and
retrieve the points stored in them. After retrieving the points, we com-
3
See [16] for more details on hashing.
log1 – P1
k ␦ so that (1 – P1
k)L ≤ ␦, then any R-neighbor of q is returned by
the algorithm with probability at least 1 – ␦.
How should the parameter k be chosen? Intuitively, larger values of
k lead to a larger gap between the probabilities of collision for close
points and far points; the probabilities are P1
k and P2
k, respectively (see
Figure 3 for an illustration). The benefit of this amplification is that the
hash functions are more selective. At the same time, if k is large then
P1
k is small, which means that L must be sufficiently large to ensure
that an R-near neighbor collides with the query point at least once.
Preprocessing:
1. Choose L functions gj, j = 1,…L, by setting gj = (h1, j, h2, j,…hk, j), where h1, j,…hk, j are chosen at random from the LSH family H.
2. Construct L hash tables, where, for each j = 1,…L, the jth hash table contains the dataset points hashed using the function gj.
Query algorithm for a query point q:
1. For each j = 1, 2,…L
i) Retrieve the points from the bucket gj(q) in the jth hash table.
ii) For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer (cR-near
neighbor for Strategy 1, and R-near neighbor for Strategy 2).
iii) (optional) Stop as soon as the number of reported points is more than LЈ.
Fig. 2. Preprocessing and query algorithms of the basic LSH algorithm.
COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 119

LSH
• What questions can we answer with this strategy, for a chosen
probability of failure delta
• Randomized c-approximate NN: L in O(nr
), where r=ln(1/P1)/ln(1/P2)
• If P1 > P2, then r < 1 so each search is sub-linear time!!
• Randomized NN: choose L = log(1-P1
k
) delta
• Choice of parameters: larger value of code length means less
populated buckets since the gap increases but, at the same time, it
means that we need a bigger number of tables L, to ensure a failure
probability.

Next steps…
• So, we can ﬁnd an R-NN in sub-linear time, and
now what?
• Unfortunately, from this point on the theory is less
revealing, but practical results are good.
• What if I cannot encode my problem with one of
those metrics?

Semantic Hashing
• It makes use of the most
internal representation of
an autoencoder as a hash
function
• Training process
• First we train a set of
stacked RBMs in a layer
wise manner
RBM
RBM
RBM

Semantic Hashing
• Training process
• First we train a set of
stacked RBMs in a layer
wise manner
• Then we ﬁne tune an
unrolled version of the
original network
N bit code

Semantic Hashing
• Search process
• Build a hash table by
locating each element
in its corresponding
bucket
• Get the elements inside
the n-hamming ball
N bit code

Applications: Clustering
• Correlation clustering!
• Allows us to ﬁnd
groups without
specifying a priori the
number of them (or
the shape)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+ +
-
-
-
-

Applications: Clustering
• Correlation clustering!
• Implementations
• Pivot algorithm: 3-approximation
• Parallel version: requires log (n)
2
iterations +
+
+
+
+
++
+
+
+
+
+
++
+ +
+ +
-
-
-
-
Pick random pivot i ∈ V!
Set , V'=Ø!
For all j ∈ V, j ≠ i;!
If (i,j) ∈ E+ then!
Add j to C!
Else (If (i,j) ∈ E−)!
Add j to V'!
Let G' be the subgraph induced by V'!
Return clustering C,CC-Pivot(G')!
• While the instance is non-empty

1.Let A be its current maximum positive degree
2.2. Activate each element independently with probability e/A

3.Deactivate all the active elements that are connected through a positive edge to
other active elements

4.The remaining active nodes become pivots
5.Create one cluster for each pivot (breaking ties randomly)

Applications: Anomaly
detection
• Local outlier factor (LOF)!
• An anomaly is a point that has an abnormal low
density when compared with other points
similar to it

Applications: Classiﬁcation
• Old intuitive kNN classiﬁer
• Semi-supervised learning

Applications: Inference
Idea: by minimising total
variation with respect with
the most connected
neighbours, we can infer
geolocation for twitter users.

See: Geotagging One Hundred Million
Twitter Accounts with Total Variation
Minimization. IEEE 2014 Conference on
Big Data
Figure 6: Histogram of tweets as a function of activity level.
For each group of users described in fig. 4 and fig. 5 we
collected the total number of tweets generated by the group.
Despite the high number of inactive users, the bulk of tweets
are generated by active Twitter users, indicating the impor-
tance of geotagging active accounts.
Figure 7: Histogram of errors with di↵erent restrictions on
the maximum allowable geographic dispersion of each user’s
(a) CDF of the geographic distance between friends (b) CDF of the geographic distance between a user and their
geographically closest friend
Figure 2: Study of contact patterns between users who reveal their location via GPS. Subgraphs of GPS users are taken
from the, the bidirectional @mention network (blue), bidirectional @mention network after filtering edges for triadic closures
(green), and the complete unidirectional @mention network (black). In (a), we see that the distances spanned by reciprocated
@mentions (blue and green) are smaller than those spanned by any @mention (black). In (b), we see that users often have
at least one online social tie with a geographically nearby user. The subgraph sizes are: 19,515,278 edges and 3,972,321
nodes (green), 20,576,189 edges and 4,488,759 node (blue), 100,126,247 edges and 5,648,220 nodes (black). We suspect
these results would be even stronger if more GPS data were available.
well-aligned with geographic distance, we restrict our atten-
tion to GPS-known users and study contact patterns between
them in fig. 2.
Users with GPS-known locations make up only a tiny por-
tion of our @mention networks. Despite the relatively small
amount of data, we can still see in fig. 2 that online social
ties typically form between users who live near each other
and that a majority of GPS-known users have at least one
GPS-known friend within 10km.
The optimization (1) models proximity of connected users.
Unfortunately, the total variation functional is nondi↵eren-
tiable and finding a global minimum is thus a formidable chal-
lenge. We will employ “parallel coordinate descent” [25] to
solve (1). Most variants of coordinate descent cycle through
the domain sequentially, updating each variable and commu-
nicating back the result before the next variable can update.
The scale of our data necessitates a parallel approach, pro-
hibiting us from making all the communication steps required
by a traditional coordinate descent method.
At each iteration, our algorithm simultaneously updates
each user’s location with the l1-multivariate median of their
friend’s locations. Only after all updates are complete do we
communicate our results over the network.
At iteration k, denote the user estimates by fk
and the
variation on the ith node by
∇i(fk
,f) =
j
wijd(f,fk
j ) (6)
Parallel coordinate descent can now be stated concisely in alg.
1.
The argument that minimizes (6) is the l1-multivariate me-
dian of the locations of the neighbours of node i. By placing
this computation inside the parfor of alg. 1, we have repro-
duced the Spatial Label Propagation algorithm of [12] as a
Algorithm 1: Parallel coordinate descent for constrained
TV minimization
Initialize: fi = li for i ∈ L
for k = 1...N do
parfor i :
if i ∈ L then
fk+1
i = li
else
fk+1
i = argmin
f
∇i(fk
,f)
end
end
fk
= fk+1
end
coordinate descent method designed to minimize total varia-
tion.
3.4 Individual Error Estimation
The vast majority of Twitter users @mention with geograph-
ically close users. However, there do exist several users who
have amassed friends dispersed around the globe. For these
users, our approach should not be used to infer location.
We use a robust estimate of the dispersion of each user’s
friend locations to infer accuracy of our geocoding algorithm.
Our estimate for the error on user i is the median absolute
deviation of the inferred locations of user i’s friends, com-
puted via (3). With a dispersion restriction as an additional
parameter, , our optimization becomes
min
f
∇f subject to fi = li for i ∈ L and max
i
∼
∇fi (7)

Applications (many more)
Blossom algorithm
Shortest paths
Search Algorithms
Bipartite Minimum Cut

Wrap up!
Raw data
Extract
Features
Locality
Hashing
Navigate
An. detection
Clustering
Inference

Building graphs to discover information by David Martínez at Big Data Spain 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (14)

Similar to Building graphs to discover information by David Martínez at Big Data Spain 2015

Similar to Building graphs to discover information by David Martínez at Big Data Spain 2015 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Building graphs to discover information by David Martínez at Big Data Spain 2015