SlideShare a Scribd company logo
1 of 32
Download to read offline
Building graphs to
discover information
Dr. David Martinez Rego
Big Data Spain 2015
Data science?
• Every once in a while you hear the same question
in the office, discussion,…
• But.. what is a data scientist?
• Of course the response is usually vague, but my
definition (from the ML point of view)
• Do whatever you can to transform raw data in
information that carries some business value?
Acme project: Day 1
• After the first handshake, reality!
• The team is usually handed data which has not
been prepared for any learnable task
• The aim (BV) is not clear or present at all
• Many books talking about design strategies
• Context, need, vision, outcome (Max Shron)
Labels?
Prior knowledge?
Exploration?
Scale?
?
Lets use a graph!
Lets use a graph!
What can I do?
• Find structure in the information
• Connected components
• Hubs
• Infer information (more on this later…)
• Clustering
• Classification
• Anomaly detection
• Many more…
Anywhere?
• Scalable graph algorithms are behind many of the recent biggest
developments
• PageRank, social networks, medical research, DevOps, …
• So, is there an extra mile?
• All these cases have something in common, they take the graph
for granted,
• it is already given by the problem
• it is highly sparse
• it carries business value
Anywhere?
• What about the case where the graph is not
explicit
• That case carries more work, since it we have to
figure out
• how to encode individuals in a way that the
graph carries the information we want
• how to build the graph itself is a challenging
problem!!
Anywhere?
• Naïve algorithm!
• for i=1..N
• for j=1..N
• M[i,j] = sim(d[i], d[j])
• prune(graph)
If we have around 1
million entities, the
calculation is way bigger
than the whole set of
tweets in a year.
Wiser options
• So then, we need techniques that allow us to calculate the k-nn
graph without having to calculate the whole adjacency matrix
• Not possible to it exactly, but possible with an error margin
• Local Sensitivity Hashing: For some specific metrics such as
euclidean, hamming, L1, some edit distances through
embeddings.
• Semantic Hashing: when the notion of metric/similarity is not
clear, can work very well although no theoretical guarantees.
• Main idea of both —> Create a hash function such that for similar
items we create collisions with high probability and for dissimilar
items, collisions are unlikely.
LSH
8 Kristen Grauman and Rob Fergus
D t b
hr r
n
Database
hr1…rb
<< n
XiSeries of b
randomized LSH
functions
Colliding instances
are searched
110101
<< n
Q
functions
Q
111101
110111
hr1…rb
Hash table:
Similar instances collide, w.h.p.
Query
Fig. 5 Locality Sensitive Hashing (LSH) uses hash keys constructed so as to guarantee collision is
more likely for more similar examples [33, 23]. Once all database items have been hashed into the
table(s), the same randomized functions are applied to novel queries. One exhaustively searches
only those examples with which the query collides.
counters some of these shortcomings, and allows a user to explicitly control the
similarity search accuracy and search time tradeoff [23].
Kristen Grauman & Rob Fergus
Local Sensitivity Hashing
LSH
• LSH relies on the existence of LSH Family of functions
for a given metric.
• A family H is (R, cR, P1, P2)-sensitive if for any two
points p and q!
• if |p-q| < R, then P[h(p) == h(q)] > P1
• if |p-q| > cR, then P[h(p) == h(q)] < P2
where h is independently randomly selected from
family H and P1 > P2
LSH
• The effect emerges from basic probability phenomenon
• For a hash function length m, there is a p1
m
probability of two close
points to collide
• On the other hand, the probability of far apart points to collide is p2
m
• If p1 > p2 then the gap would increase with moderate code sizes
• Unfortunately, when designing the LSH we can not always achieve a
high p1
• Build several tables with the same strategy so the probability of
finding an approximate nearest neighbour increases by a union
bound.
LSH
input set into a bucket gj(p), for j = 1,…,L. Since the total number of
buckets may be large, we retain only the nonempty buckets by resort-
ing to (standard) hashing3
of the values gj(p). In this way, the data
structure uses only O(nL) memory cells; note that it suffices that the
buckets store the pointers to data points, not the points themselves.
To process a query q, we scan through the buckets g1(q),…, gL(q), and
retrieve the points stored in them. After retrieving the points, we com-
3
See [16] for more details on hashing.
log1 – P1
k ␦ so that (1 – P1
k)L ≤ ␦, then any R-neighbor of q is returned by
the algorithm with probability at least 1 – ␦.
How should the parameter k be chosen? Intuitively, larger values of
k lead to a larger gap between the probabilities of collision for close
points and far points; the probabilities are P1
k and P2
k, respectively (see
Figure 3 for an illustration). The benefit of this amplification is that the
hash functions are more selective. At the same time, if k is large then
P1
k is small, which means that L must be sufficiently large to ensure
that an R-near neighbor collides with the query point at least once.
Preprocessing:
1. Choose L functions gj, j = 1,…L, by setting gj = (h1, j, h2, j,…hk, j), where h1, j,…hk, j are chosen at random from the LSH family H.
2. Construct L hash tables, where, for each j = 1,…L, the jth hash table contains the dataset points hashed using the function gj.
Query algorithm for a query point q:
1. For each j = 1, 2,…L
i) Retrieve the points from the bucket gj(q) in the jth hash table.
ii) For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer (cR-near
neighbor for Strategy 1, and R-near neighbor for Strategy 2).
iii) (optional) Stop as soon as the number of reported points is more than LЈ.
Fig. 2. Preprocessing and query algorithms of the basic LSH algorithm.
COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 119
LSH
• What questions can we answer with this strategy, for a chosen
probability of failure delta
• Randomized c-approximate NN: L in O(nr
), where r=ln(1/P1)/ln(1/P2)
• If P1 > P2, then r < 1 so each search is sub-linear time!!
• Randomized NN: choose L = log(1-P1
k
) delta
• Choice of parameters: larger value of code length means less
populated buckets since the gap increases but, at the same time, it
means that we need a bigger number of tables L, to ensure a failure
probability.
Next steps…
• So, we can find an R-NN in sub-linear time, and
now what?
• Unfortunately, from this point on the theory is less
revealing, but practical results are good.
• What if I cannot encode my problem with one of
those metrics?
Semantic Hashing
• It makes use of the most
internal representation of
an autoencoder as a hash
function
• Training process
• First we train a set of
stacked RBMs in a layer
wise manner
RBM
RBM
RBM
Semantic Hashing
• Training process
• First we train a set of
stacked RBMs in a layer
wise manner
• Then we fine tune an
unrolled version of the
original network
N bit code
Semantic Hashing
• Search process
• Build a hash table by
locating each element
in its corresponding
bucket
• Get the elements inside
the n-hamming ball
N bit code
Applications: Clustering
• Correlation clustering!
• Allows us to find
groups without
specifying a priori the
number of them (or
the shape)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+ +
-
-
-
-
Applications: Clustering
• Correlation clustering!
• Implementations
• Pivot algorithm: 3-approximation
• Parallel version: requires log (n)
2
iterations +
+
+
+
+
++
+
+
+
+
+
++
+ +
+ +
-
-
-
-
Pick random pivot i ∈ V!
Set , V'=Ø!
For all j ∈ V, j ≠ i;!
If (i,j) ∈ E+ then!
Add j to C!
Else (If (i,j) ∈ E−)!
Add j to V'!
Let G' be the subgraph induced by V'!
Return clustering C,CC-Pivot(G')!
• While the instance is non-empty	

1.Let A be its current maximum positive degree
2.2. Activate each element independently with probability e/A	

3.Deactivate all the active elements that are connected through a positive edge to
other active elements	

4.The remaining active nodes become pivots
5.Create one cluster for each pivot (breaking ties randomly)
Applications: Anomaly
detection
• Local outlier factor (LOF)!
• An anomaly is a point that has an abnormal low
density when compared with other points
similar to it
Applications: Classification
• Old intuitive kNN classifier
• Semi-supervised learning
Applications: Inference
Idea: by minimising total
variation with respect with
the most connected
neighbours, we can infer
geolocation for twitter users.	

	

See: Geotagging One Hundred Million
Twitter Accounts with Total Variation
Minimization. IEEE 2014 Conference on
Big Data
Figure 6: Histogram of tweets as a function of activity level.
For each group of users described in fig. 4 and fig. 5 we
collected the total number of tweets generated by the group.
Despite the high number of inactive users, the bulk of tweets
are generated by active Twitter users, indicating the impor-
tance of geotagging active accounts.
Figure 7: Histogram of errors with di↵erent restrictions on
the maximum allowable geographic dispersion of each user’s
(a) CDF of the geographic distance between friends (b) CDF of the geographic distance between a user and their
geographically closest friend
Figure 2: Study of contact patterns between users who reveal their location via GPS. Subgraphs of GPS users are taken
from the, the bidirectional @mention network (blue), bidirectional @mention network after filtering edges for triadic closures
(green), and the complete unidirectional @mention network (black). In (a), we see that the distances spanned by reciprocated
@mentions (blue and green) are smaller than those spanned by any @mention (black). In (b), we see that users often have
at least one online social tie with a geographically nearby user. The subgraph sizes are: 19,515,278 edges and 3,972,321
nodes (green), 20,576,189 edges and 4,488,759 node (blue), 100,126,247 edges and 5,648,220 nodes (black). We suspect
these results would be even stronger if more GPS data were available.
well-aligned with geographic distance, we restrict our atten-
tion to GPS-known users and study contact patterns between
them in fig. 2.
Users with GPS-known locations make up only a tiny por-
tion of our @mention networks. Despite the relatively small
amount of data, we can still see in fig. 2 that online social
ties typically form between users who live near each other
and that a majority of GPS-known users have at least one
GPS-known friend within 10km.
The optimization (1) models proximity of connected users.
Unfortunately, the total variation functional is nondi↵eren-
tiable and finding a global minimum is thus a formidable chal-
lenge. We will employ “parallel coordinate descent” [25] to
solve (1). Most variants of coordinate descent cycle through
the domain sequentially, updating each variable and commu-
nicating back the result before the next variable can update.
The scale of our data necessitates a parallel approach, pro-
hibiting us from making all the communication steps required
by a traditional coordinate descent method.
At each iteration, our algorithm simultaneously updates
each user’s location with the l1-multivariate median of their
friend’s locations. Only after all updates are complete do we
communicate our results over the network.
At iteration k, denote the user estimates by fk
and the
variation on the ith node by
∇i(fk
,f) = 
j
wijd(f,fk
j ) (6)
Parallel coordinate descent can now be stated concisely in alg.
1.
The argument that minimizes (6) is the l1-multivariate me-
dian of the locations of the neighbours of node i. By placing
this computation inside the parfor of alg. 1, we have repro-
duced the Spatial Label Propagation algorithm of [12] as a
Algorithm 1: Parallel coordinate descent for constrained
TV minimization
Initialize: fi = li for i ∈ L
for k = 1...N do
parfor i :
if i ∈ L then
fk+1
i = li
else
fk+1
i = argmin
f
∇i(fk
,f)
end
end
fk
= fk+1
end
coordinate descent method designed to minimize total varia-
tion.
3.4 Individual Error Estimation
The vast majority of Twitter users @mention with geograph-
ically close users. However, there do exist several users who
have amassed friends dispersed around the globe. For these
users, our approach should not be used to infer location.
We use a robust estimate of the dispersion of each user’s
friend locations to infer accuracy of our geocoding algorithm.
Our estimate for the error on user i is the median absolute
deviation of the inferred locations of user i’s friends, com-
puted via (3). With a dispersion restriction as an additional
parameter, , our optimization becomes
min
f
∇f subject to fi = li for i ∈ L and max
i
∼
∇fi  (7)
Applications (many more)
Blossom algorithm
Shortest paths
Search Algorithms
Bipartite Minimum Cut
Wrap up!
Raw data
Extract
Features
Locality
Hashing
Navigate
An. detection
Clustering
Inference
Building graphs to
discover information
Dr. David Martinez Rego
Big Data Spain 2015

More Related Content

What's hot

Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
A Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching AlgorithmsA Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching Algorithmszexin wan
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
Enhancing security in cloud storage
Enhancing security in cloud storageEnhancing security in cloud storage
Enhancing security in cloud storageShivam Singh
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...Konstantinos Giannakis
 
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...Association for Computational Linguistics
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processingjins0618
 

What's hot (12)

Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Hashing
HashingHashing
Hashing
 
A Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching AlgorithmsA Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching Algorithms
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
poster
posterposter
poster
 
Enhancing security in cloud storage
Enhancing security in cloud storageEnhancing security in cloud storage
Enhancing security in cloud storage
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...
 
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 

Viewers also liked

Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Big Data Spain
 
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Big Data Spain
 
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Big Data Spain
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data Spain
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Big Data Spain
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
 
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...Big Data Spain
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
 
A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...Big Data Spain
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...Big Data Spain
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
 
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Big Data Spain
 
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...Big Data Spain
 

Viewers also liked (14)

Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
 
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
 
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
 
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
 

Similar to Building graphs to discover information by David Martínez at Big Data Spain 2015

Local sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friendLocal sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friendChengeng Ma
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithmsSimon Belak
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data ConferenceDataTactics
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsToyotaro Suzumura
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
CPSC 125 Ch 4 Sec 5
CPSC 125 Ch 4 Sec 5CPSC 125 Ch 4 Sec 5
CPSC 125 Ch 4 Sec 5David Wood
 
Data structure-questions
Data structure-questionsData structure-questions
Data structure-questionsShekhar Chander
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Gingles Caroline
 
Probabilistic programming
Probabilistic programmingProbabilistic programming
Probabilistic programmingEli Gottlieb
 
PyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at ScalePyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at ScaleGoDataDriven
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdfEmerald72
 

Similar to Building graphs to discover information by David Martínez at Big Data Spain 2015 (20)

Local sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friendLocal sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friend
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
CPSC 125 Ch 4 Sec 5
CPSC 125 Ch 4 Sec 5CPSC 125 Ch 4 Sec 5
CPSC 125 Ch 4 Sec 5
 
Data structure-questions
Data structure-questionsData structure-questions
Data structure-questions
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
 
Probabilistic programming
Probabilistic programmingProbabilistic programming
Probabilistic programming
 
PyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at ScalePyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at Scale
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Building graphs to discover information by David Martínez at Big Data Spain 2015

  • 1.
  • 2. Building graphs to discover information Dr. David Martinez Rego Big Data Spain 2015
  • 3. Data science? • Every once in a while you hear the same question in the office, discussion,… • But.. what is a data scientist? • Of course the response is usually vague, but my definition (from the ML point of view) • Do whatever you can to transform raw data in information that carries some business value?
  • 4. Acme project: Day 1 • After the first handshake, reality! • The team is usually handed data which has not been prepared for any learnable task • The aim (BV) is not clear or present at all • Many books talking about design strategies • Context, need, vision, outcome (Max Shron)
  • 5.
  • 6.
  • 8. ?
  • 9. Lets use a graph!
  • 10. Lets use a graph!
  • 11. What can I do? • Find structure in the information • Connected components • Hubs • Infer information (more on this later…) • Clustering • Classification • Anomaly detection • Many more…
  • 12. Anywhere? • Scalable graph algorithms are behind many of the recent biggest developments • PageRank, social networks, medical research, DevOps, … • So, is there an extra mile? • All these cases have something in common, they take the graph for granted, • it is already given by the problem • it is highly sparse • it carries business value
  • 13. Anywhere? • What about the case where the graph is not explicit • That case carries more work, since it we have to figure out • how to encode individuals in a way that the graph carries the information we want • how to build the graph itself is a challenging problem!!
  • 14. Anywhere? • Naïve algorithm! • for i=1..N • for j=1..N • M[i,j] = sim(d[i], d[j]) • prune(graph) If we have around 1 million entities, the calculation is way bigger than the whole set of tweets in a year.
  • 15. Wiser options • So then, we need techniques that allow us to calculate the k-nn graph without having to calculate the whole adjacency matrix • Not possible to it exactly, but possible with an error margin • Local Sensitivity Hashing: For some specific metrics such as euclidean, hamming, L1, some edit distances through embeddings. • Semantic Hashing: when the notion of metric/similarity is not clear, can work very well although no theoretical guarantees. • Main idea of both —> Create a hash function such that for similar items we create collisions with high probability and for dissimilar items, collisions are unlikely.
  • 16. LSH 8 Kristen Grauman and Rob Fergus D t b hr r n Database hr1…rb << n XiSeries of b randomized LSH functions Colliding instances are searched 110101 << n Q functions Q 111101 110111 hr1…rb Hash table: Similar instances collide, w.h.p. Query Fig. 5 Locality Sensitive Hashing (LSH) uses hash keys constructed so as to guarantee collision is more likely for more similar examples [33, 23]. Once all database items have been hashed into the table(s), the same randomized functions are applied to novel queries. One exhaustively searches only those examples with which the query collides. counters some of these shortcomings, and allows a user to explicitly control the similarity search accuracy and search time tradeoff [23]. Kristen Grauman & Rob Fergus Local Sensitivity Hashing
  • 17. LSH • LSH relies on the existence of LSH Family of functions for a given metric. • A family H is (R, cR, P1, P2)-sensitive if for any two points p and q! • if |p-q| < R, then P[h(p) == h(q)] > P1 • if |p-q| > cR, then P[h(p) == h(q)] < P2 where h is independently randomly selected from family H and P1 > P2
  • 18. LSH • The effect emerges from basic probability phenomenon • For a hash function length m, there is a p1 m probability of two close points to collide • On the other hand, the probability of far apart points to collide is p2 m • If p1 > p2 then the gap would increase with moderate code sizes • Unfortunately, when designing the LSH we can not always achieve a high p1 • Build several tables with the same strategy so the probability of finding an approximate nearest neighbour increases by a union bound.
  • 19. LSH input set into a bucket gj(p), for j = 1,…,L. Since the total number of buckets may be large, we retain only the nonempty buckets by resort- ing to (standard) hashing3 of the values gj(p). In this way, the data structure uses only O(nL) memory cells; note that it suffices that the buckets store the pointers to data points, not the points themselves. To process a query q, we scan through the buckets g1(q),…, gL(q), and retrieve the points stored in them. After retrieving the points, we com- 3 See [16] for more details on hashing. log1 – P1 k ␦ so that (1 – P1 k)L ≤ ␦, then any R-neighbor of q is returned by the algorithm with probability at least 1 – ␦. How should the parameter k be chosen? Intuitively, larger values of k lead to a larger gap between the probabilities of collision for close points and far points; the probabilities are P1 k and P2 k, respectively (see Figure 3 for an illustration). The benefit of this amplification is that the hash functions are more selective. At the same time, if k is large then P1 k is small, which means that L must be sufficiently large to ensure that an R-near neighbor collides with the query point at least once. Preprocessing: 1. Choose L functions gj, j = 1,…L, by setting gj = (h1, j, h2, j,…hk, j), where h1, j,…hk, j are chosen at random from the LSH family H. 2. Construct L hash tables, where, for each j = 1,…L, the jth hash table contains the dataset points hashed using the function gj. Query algorithm for a query point q: 1. For each j = 1, 2,…L i) Retrieve the points from the bucket gj(q) in the jth hash table. ii) For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer (cR-near neighbor for Strategy 1, and R-near neighbor for Strategy 2). iii) (optional) Stop as soon as the number of reported points is more than LЈ. Fig. 2. Preprocessing and query algorithms of the basic LSH algorithm. COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 119
  • 20. LSH • What questions can we answer with this strategy, for a chosen probability of failure delta • Randomized c-approximate NN: L in O(nr ), where r=ln(1/P1)/ln(1/P2) • If P1 > P2, then r < 1 so each search is sub-linear time!! • Randomized NN: choose L = log(1-P1 k ) delta • Choice of parameters: larger value of code length means less populated buckets since the gap increases but, at the same time, it means that we need a bigger number of tables L, to ensure a failure probability.
  • 21. Next steps… • So, we can find an R-NN in sub-linear time, and now what? • Unfortunately, from this point on the theory is less revealing, but practical results are good. • What if I cannot encode my problem with one of those metrics?
  • 22. Semantic Hashing • It makes use of the most internal representation of an autoencoder as a hash function • Training process • First we train a set of stacked RBMs in a layer wise manner RBM RBM RBM
  • 23. Semantic Hashing • Training process • First we train a set of stacked RBMs in a layer wise manner • Then we fine tune an unrolled version of the original network N bit code
  • 24. Semantic Hashing • Search process • Build a hash table by locating each element in its corresponding bucket • Get the elements inside the n-hamming ball N bit code
  • 25. Applications: Clustering • Correlation clustering! • Allows us to find groups without specifying a priori the number of them (or the shape) + + + + + + + + + + + + + + + + + + - - - -
  • 26. Applications: Clustering • Correlation clustering! • Implementations • Pivot algorithm: 3-approximation • Parallel version: requires log (n) 2 iterations + + + + + ++ + + + + + ++ + + + + - - - - Pick random pivot i ∈ V! Set , V'=Ø! For all j ∈ V, j ≠ i;! If (i,j) ∈ E+ then! Add j to C! Else (If (i,j) ∈ E−)! Add j to V'! Let G' be the subgraph induced by V'! Return clustering C,CC-Pivot(G')! • While the instance is non-empty 1.Let A be its current maximum positive degree 2.2. Activate each element independently with probability e/A 3.Deactivate all the active elements that are connected through a positive edge to other active elements 4.The remaining active nodes become pivots 5.Create one cluster for each pivot (breaking ties randomly)
  • 27. Applications: Anomaly detection • Local outlier factor (LOF)! • An anomaly is a point that has an abnormal low density when compared with other points similar to it
  • 28. Applications: Classification • Old intuitive kNN classifier • Semi-supervised learning
  • 29. Applications: Inference Idea: by minimising total variation with respect with the most connected neighbours, we can infer geolocation for twitter users. See: Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization. IEEE 2014 Conference on Big Data Figure 6: Histogram of tweets as a function of activity level. For each group of users described in fig. 4 and fig. 5 we collected the total number of tweets generated by the group. Despite the high number of inactive users, the bulk of tweets are generated by active Twitter users, indicating the impor- tance of geotagging active accounts. Figure 7: Histogram of errors with di↵erent restrictions on the maximum allowable geographic dispersion of each user’s (a) CDF of the geographic distance between friends (b) CDF of the geographic distance between a user and their geographically closest friend Figure 2: Study of contact patterns between users who reveal their location via GPS. Subgraphs of GPS users are taken from the, the bidirectional @mention network (blue), bidirectional @mention network after filtering edges for triadic closures (green), and the complete unidirectional @mention network (black). In (a), we see that the distances spanned by reciprocated @mentions (blue and green) are smaller than those spanned by any @mention (black). In (b), we see that users often have at least one online social tie with a geographically nearby user. The subgraph sizes are: 19,515,278 edges and 3,972,321 nodes (green), 20,576,189 edges and 4,488,759 node (blue), 100,126,247 edges and 5,648,220 nodes (black). We suspect these results would be even stronger if more GPS data were available. well-aligned with geographic distance, we restrict our atten- tion to GPS-known users and study contact patterns between them in fig. 2. Users with GPS-known locations make up only a tiny por- tion of our @mention networks. Despite the relatively small amount of data, we can still see in fig. 2 that online social ties typically form between users who live near each other and that a majority of GPS-known users have at least one GPS-known friend within 10km. The optimization (1) models proximity of connected users. Unfortunately, the total variation functional is nondi↵eren- tiable and finding a global minimum is thus a formidable chal- lenge. We will employ “parallel coordinate descent” [25] to solve (1). Most variants of coordinate descent cycle through the domain sequentially, updating each variable and commu- nicating back the result before the next variable can update. The scale of our data necessitates a parallel approach, pro- hibiting us from making all the communication steps required by a traditional coordinate descent method. At each iteration, our algorithm simultaneously updates each user’s location with the l1-multivariate median of their friend’s locations. Only after all updates are complete do we communicate our results over the network. At iteration k, denote the user estimates by fk and the variation on the ith node by ∇i(fk ,f) = j wijd(f,fk j ) (6) Parallel coordinate descent can now be stated concisely in alg. 1. The argument that minimizes (6) is the l1-multivariate me- dian of the locations of the neighbours of node i. By placing this computation inside the parfor of alg. 1, we have repro- duced the Spatial Label Propagation algorithm of [12] as a Algorithm 1: Parallel coordinate descent for constrained TV minimization Initialize: fi = li for i ∈ L for k = 1...N do parfor i : if i ∈ L then fk+1 i = li else fk+1 i = argmin f ∇i(fk ,f) end end fk = fk+1 end coordinate descent method designed to minimize total varia- tion. 3.4 Individual Error Estimation The vast majority of Twitter users @mention with geograph- ically close users. However, there do exist several users who have amassed friends dispersed around the globe. For these users, our approach should not be used to infer location. We use a robust estimate of the dispersion of each user’s friend locations to infer accuracy of our geocoding algorithm. Our estimate for the error on user i is the median absolute deviation of the inferred locations of user i’s friends, com- puted via (3). With a dispersion restriction as an additional parameter, , our optimization becomes min f ∇f subject to fi = li for i ∈ L and max i ∼ ∇fi (7)
  • 30. Applications (many more) Blossom algorithm Shortest paths Search Algorithms Bipartite Minimum Cut
  • 32. Building graphs to discover information Dr. David Martinez Rego Big Data Spain 2015