SlideShare a Scribd company logo
Data Mining Project
By Gabriele Angeletti
angeletti.gabriele@gmail.com
Image Similarity & Semi-Supervised Learning
Document Similarity:
Shingling + Locality Sensitive Hashing
Jaccard Distance = #(D1 ⋂ D2) / #(D1 ⋃ D2)
What about Image Similarity?
● Very high dimensional spaces
● Single dimension carries very little information
● Many variants and transformations (translation, rotation, scaling, lighting etc*)
● Computationally demanding
● Jaccard Distance performs poorly in this context
Locality Sensitive Hashing
Main idea: hash points in buckets, such that the nearer two points are, the more
likely they are to be hashed in the same bucket.
Need a definition of “near”, and a function that maps “nearby” points to the same
buckets.
LSH Family H: set of function H that is (d1, d2, p1, p2)-sensitive. For points q,p:
If D(p,q) < d1 then P(h(p) == h(q)) > p1
If D(p,q) > d2 then P(h(p) == h(q)) < p2
H: hash functions, D: definition of “near” (“far” actually).
LSH problems: many hyper-parameters to fine-tune, computationally expensive
LSH Forest
LSH index place each point p into a bucket with label g(p) = (h1(p), h2(p), . . . , hk(p)), g(p) is the k-digit label
assigned to point p.
LSH Forest, instead of assigning fixed-length labels to points, let the labels be of variable length; each label
is made long enough to ensure that every point has a distinct label. A maximum label length km is imposed.
The variable length label generation: let h1, h2, . . . , hkm be a sequence of km hash functions drawn
independently and uniformly at random from H. The length-x label of a point p is given by g(p, x) = (h1(p), h2
(p), . . . , hx(p)).
Ref: M. Bawa, T. Condie and P. Ganesan, “LSH Forest: Self-Tuning Indexes for Similarity Search”, WWW ‘05 Proceedings of the 14th
international conference on World Wide Web, 651-660, 2005.
LSH Forest (cont.)
LSH Tree: logical prefix tree on the set of all labels, with each leaf corresponding to a point.
LSH Forest: composed of L such LSH Trees, each constructed with an independently drawn random
sequence of hash functions from H.
Query processing (m nearest neighbors of point p):
Traversing the LSH Trees in two phases.
In the first top-down phase, find the leaf having the largest
prefix match with p’s label. x := maxl i{xi} is the bottom-most
level of leaf nodes across all L trees.
In the second bottom-up phase, we collect M points from
the LSH Forest, moving up from level x towards the root.
The M points are then ranked in order of decreasing
similarity with p and returned.
LSH Forest - test setting
Distance function: cosine distance
LSHForest implementation: sklearn
Hyper-parameters: default values of sklearn
Accuracy metric for the k-th nearest neighbor: (p / data_len)
p = 0
for each element in test set:
If label(element) == label(k nearest neighbor of element)
p++
This accuracy metric measures the ability of LSH Forest to retrieve elements of the same class.
LSH Forest - MNIST results
Dataset: MNIST - handwritten digits grayscale image dataset. 28x28 pixels per image
Training set - 50k images
Test set - 10k images
Dataset random samples:
LSH Forest accuracy@k for k = 1, … , 50 neighbors
Accuracy of ~95% for the first neighbor, drops to
~57% for the 50th neighbor.
LSH Forest - notMNIST results
Dataset: notMNIST - grayscale images of letters from A to J in different fonts. 28x28 pixels per image
Training set - 200k images
Test set - 10k images
Dataset random samples:
LSH Forest accuracy@k for k = 1, … , 50 neighbors
Accuracy of ~92% for the first neighbor, drops to
~70% for the 50th neighbor.
Can we do
better?
Idea: extract higher-level
features with semantic meaning.
How: Unsupervised Neural
Networks (Autoencoders)
● LSH Forest took 784 feature
vectors as input, treating each
pixel like a feature
● Extracting higher-level features
would:
- compress the data
- be computationally efficient
(features << 784)
- improve performance
(comparisons done between
features with semantic meaning
rather than raw pixels)
Autoencoders
Idea: Map input space to another space, and then
reconstruct the original input from that space.
The hope is that if the autoencoder is able to
reconstruct the original input from a lower dimensional
representation of if, it means that the representation
has successfully captured important features in the
input distribution.
The mappings are linear in the model parameters (W, b),
followed by a non-linearity, e.s. tanh(Wx + b).
The loss function is usually the l2 loss (mean squared error)
or the cross-entropy loss.
Autoencoders - math
Input x with dimension (N x P)
Model parameters:
- matrix W of dimension (P x H)
- vector bh of dimension H
- vector bv of dimension P
h = sigma(Wx + bh) // latent representation (encoder)
z = sigma(W’h + bv) // reconstruction (decoder)
loss = loss_function(x, z)
The parameters update rule is derived using backpropagation (i.e. computing
the gradients with respect to the chosen loss function).
theta’ = theta + learning_rate * gradient_wrt_theta
sigma() is a non-linear activation function (common choices are sigmoid and tanh)
Denoising Autoencoders
Autoencoder variant trained to reconstruct the original input starting from a
corrupted version of it (denoising task).
Idea: If the autoencoder is able to reconstruct the input ( )
from a corrupted version of it ( ) we can expect that it must have
captured robust features in the latent representation.
Corruption method used - masking noise: a random fraction v of pixels
is set to 0. Other possible methods: salt-and-pepper noise (fraction v
flipped to minimum or maximum value), gaussian noise.
x~ = noise(x)
h = sigma(Wx~ + bh) // latent representation (encoder)
z = sigma(W’h + bv) // reconstruction (decoder)
loss = loss_function(x, z)
Stacked (Denoising) Autoencoders
We can take the latent representation learned by an autoencoder, and use it as
training data for a second autoencoder. In this way, we can create
a stack of autoencoders, where each layer learns a higher-level
representation of the input distribution with respect to the layer
below it. It can be theoretically proved that adding a layer to the stack
will improve the variational bound on the log probability of the data,
if the layers are trained properly.
This procedure is called unsupervised pre-training.
Once the autoencoders are trained, we can construct the final architecture
(deep autoencoder) and perform unsupervised fine-tuning of all the
layers together.
Deep Autoencoder - visualize high-level features
Visualization: t-distributed stochastic neighbor embedding (t-SNE)
Left: t-SNE of the original MNIST (784 to 2), Right: t-SNE of the latent representation of a trained deep autoencoder (64 to 2)
Combine Deep
Autoencoders
with LSH Forest
● Deep Autoencoders extract high-level
meaningful features (encodings)
● Encodings dimension is typically much
lower than original dimension (data
compression + computational efficiency)
● Similarity can be computed at the
encodings level rather than pixel level (i.
e. similarity between two high-level
features (e.g. eyes) is much more
meaningful than similarity between pairs
of pixels
● This more powerful similarity detection
algorithm can be used in semi-
supervised settings (more on that later)
Deep Autoencoder + LSH Forest - MNIST
- Deep Autoencoder architecture:
4 encoders - 784 → 2048 → 1024 → 256 → 64
4 decoders - 64 → 256 → 1024 → 2048 → 784
- LSH Forest ran on 64d encodings
- Sample reconstructions of the model:
LSH Forest accuracy@k for k = 1, … , 50 neighbors
Similar accuracy for the first neighbors, +~35%
deep autoencoder improvement for 50th neighbor
Deep Autoencoder + LSH Forest - notMNIST
- Deep Autoencoder architecture:
4 encoders - 784 → 2048 → 1024 → 256 → 64
4 decoders - 64 → 256 → 1024 → 2048 → 784
- LSH Forest ran on 64d encodings
- Sample reconstructions of the model:
LSH Forest accuracy@k for k = 1, … , 50 neighbors
Similar accuracy for the first neighbors, +~16%
deep autoencoder improvement for 50th neighbor
Deep Autoencoder + LSH Forest - Query results
MNIST: query image: , neighbors indexes: 1, 6, 12, 20, 27, 36, 50
Basic LSH (n. of 2: 17/50) Deep Autoencoder + LSH (n. of 2: 50/50)
notMNIST: query image: , neighbors indexes: 1, 6, 12, 20, 27, 36, 50
Basic LSH (n. Of G: 7/50) Deep Autoencoder + LSH (n. Of G: 37/50)
Can we exploit
better similarity
techniques in
semi-supervised
settings?
● Big Data (unlabeled data >>>> labeled data)
● Supervised learning algorithms need labels
● Assumption: dataset with k% unlabeled data
(say 98%) and j% labeled data (say 2%)
● Can we infer the labels of the 98% of the
data using the labels we have for the 2%?
● Deep Autoencoder + LSH is totally
unsupervised: we can use 100% of the data
● Idea: estimate the label of an item with the
label of the most similar item for which we
have the label
Two approaches
First Found: Assign the label of the
most similar item for which the label
is known.
Majority Voting: Assign the most
frequent label between neighbors.
The two results are nearly identical between
the two approaches.
Result: By knowing only 2750 / 55000 (5%) of
the labels in the training set, we can infer the
labels for the test set with a ~95% accuracy.
Two more metrics
The average position of the first neighbor
with known label decreases as the number
of known labels increases. Thus, the more
labels we know, the less neighbors we have
to compute with LSH.
The average number of not found elements with
known label decreases as the number of known
label increases. For this not found elements, a
random label is chosen. Anyway, for just 5% of
known labels, avg not founds are just 0.2.
Future work:
● Try majority voting weighted on distances
● Try convolutional autoencoders instead of denoising
autoencoders (Huge expected improvement especially
with images, provides translation and rotation invariance)
● Try with more complex datasets (e.g. CIFAR-10, 32x32
color images)
● Try the approach with other domains (e.g. sound, text, ...)
● Object similarity between domains (e.g. image of two
similar to sound of person spelling “two”)

More Related Content

What's hot

Dijkstra's algorithm presentation
Dijkstra's algorithm presentationDijkstra's algorithm presentation
Dijkstra's algorithm presentation
Subid Biswas
 
Db Scan
Db ScanDb Scan
Dstar Lite
Dstar LiteDstar Lite
Dstar Lite
Adrian Sotelo
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
Rajeev Raman
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
Sakthivel C R
 
d
dd
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
Nesreen K. Ahmed
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
Krishna Chaytaniah
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
Akisato Kimura
 
Sortsearch
SortsearchSortsearch
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
Databricks
 
Merge sort
Merge sortMerge sort
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
Menlo Systems GmbH
 
Uninformed search
Uninformed searchUninformed search
Uninformed search
Bablu Shofi
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 

What's hot (19)

Dijkstra's algorithm presentation
Dijkstra's algorithm presentationDijkstra's algorithm presentation
Dijkstra's algorithm presentation
 
Db Scan
Db ScanDb Scan
Db Scan
 
Dstar Lite
Dstar LiteDstar Lite
Dstar Lite
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
d
dd
d
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Sortsearch
SortsearchSortsearch
Sortsearch
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
Merge sort
Merge sortMerge sort
Merge sort
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Uninformed search
Uninformed searchUninformed search
Uninformed search
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 

Viewers also liked

Multidimensional Data in the VO
Multidimensional Data in the VOMultidimensional Data in the VO
Multidimensional Data in the VO
Jose Enrique Ruiz
 
Agile Testing - LAST Conference 2015
Agile Testing - LAST Conference 2015Agile Testing - LAST Conference 2015
Agile Testing - LAST Conference 2015
Theresa Neate
 
A survey on massively Parallelism for indexing multidimensional datasets on t...
A survey on massively Parallelism for indexing multidimensional datasets on t...A survey on massively Parallelism for indexing multidimensional datasets on t...
A survey on massively Parallelism for indexing multidimensional datasets on t...
Tejovat Technologies Pvt.Ltd.,Wakad
 
Multidimensional Indexing
Multidimensional IndexingMultidimensional Indexing
Multidimensional Indexing
Digvijay Singh
 
Multidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksMultidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex Networks
Lino Possamai
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
Pratik Meshram
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
Spark Summit
 
차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)
beom kyun choi
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
Yan Xu
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
koolkampus
 

Viewers also liked (11)

Multidimensional Data in the VO
Multidimensional Data in the VOMultidimensional Data in the VO
Multidimensional Data in the VO
 
Agile Testing - LAST Conference 2015
Agile Testing - LAST Conference 2015Agile Testing - LAST Conference 2015
Agile Testing - LAST Conference 2015
 
A survey on massively Parallelism for indexing multidimensional datasets on t...
A survey on massively Parallelism for indexing multidimensional datasets on t...A survey on massively Parallelism for indexing multidimensional datasets on t...
A survey on massively Parallelism for indexing multidimensional datasets on t...
 
Multidimensional Indexing
Multidimensional IndexingMultidimensional Indexing
Multidimensional Indexing
 
Multidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksMultidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex Networks
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 

Similar to Project - Deep Locality Sensitive Hashing

Clique and sting
Clique and stingClique and sting
Clique and sting
Subramanyam Natarajan
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
Maruf Aytekin
 
Computer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC AlgorithmComputer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC Algorithm
allyn joy calcaben
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
tthonet
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
홍배 김
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
Sasha Lazarevic
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
Oswald Campesato
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
NAVER D2
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
"FENG "GEORGE"" YU
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
ssuser2be88c
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
Hong Wu
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
Oswald Campesato
 

Similar to Project - Deep Locality Sensitive Hashing (20)

Clique and sting
Clique and stingClique and sting
Clique and sting
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
 
Computer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC AlgorithmComputer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC Algorithm
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 

Recently uploaded

The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 

Recently uploaded (20)

The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 

Project - Deep Locality Sensitive Hashing

  • 1. Data Mining Project By Gabriele Angeletti angeletti.gabriele@gmail.com
  • 2. Image Similarity & Semi-Supervised Learning Document Similarity: Shingling + Locality Sensitive Hashing Jaccard Distance = #(D1 ⋂ D2) / #(D1 ⋃ D2) What about Image Similarity? ● Very high dimensional spaces ● Single dimension carries very little information ● Many variants and transformations (translation, rotation, scaling, lighting etc*) ● Computationally demanding ● Jaccard Distance performs poorly in this context
  • 3. Locality Sensitive Hashing Main idea: hash points in buckets, such that the nearer two points are, the more likely they are to be hashed in the same bucket. Need a definition of “near”, and a function that maps “nearby” points to the same buckets. LSH Family H: set of function H that is (d1, d2, p1, p2)-sensitive. For points q,p: If D(p,q) < d1 then P(h(p) == h(q)) > p1 If D(p,q) > d2 then P(h(p) == h(q)) < p2 H: hash functions, D: definition of “near” (“far” actually). LSH problems: many hyper-parameters to fine-tune, computationally expensive
  • 4. LSH Forest LSH index place each point p into a bucket with label g(p) = (h1(p), h2(p), . . . , hk(p)), g(p) is the k-digit label assigned to point p. LSH Forest, instead of assigning fixed-length labels to points, let the labels be of variable length; each label is made long enough to ensure that every point has a distinct label. A maximum label length km is imposed. The variable length label generation: let h1, h2, . . . , hkm be a sequence of km hash functions drawn independently and uniformly at random from H. The length-x label of a point p is given by g(p, x) = (h1(p), h2 (p), . . . , hx(p)). Ref: M. Bawa, T. Condie and P. Ganesan, “LSH Forest: Self-Tuning Indexes for Similarity Search”, WWW ‘05 Proceedings of the 14th international conference on World Wide Web, 651-660, 2005.
  • 5. LSH Forest (cont.) LSH Tree: logical prefix tree on the set of all labels, with each leaf corresponding to a point. LSH Forest: composed of L such LSH Trees, each constructed with an independently drawn random sequence of hash functions from H. Query processing (m nearest neighbors of point p): Traversing the LSH Trees in two phases. In the first top-down phase, find the leaf having the largest prefix match with p’s label. x := maxl i{xi} is the bottom-most level of leaf nodes across all L trees. In the second bottom-up phase, we collect M points from the LSH Forest, moving up from level x towards the root. The M points are then ranked in order of decreasing similarity with p and returned.
  • 6. LSH Forest - test setting Distance function: cosine distance LSHForest implementation: sklearn Hyper-parameters: default values of sklearn Accuracy metric for the k-th nearest neighbor: (p / data_len) p = 0 for each element in test set: If label(element) == label(k nearest neighbor of element) p++ This accuracy metric measures the ability of LSH Forest to retrieve elements of the same class.
  • 7. LSH Forest - MNIST results Dataset: MNIST - handwritten digits grayscale image dataset. 28x28 pixels per image Training set - 50k images Test set - 10k images Dataset random samples: LSH Forest accuracy@k for k = 1, … , 50 neighbors Accuracy of ~95% for the first neighbor, drops to ~57% for the 50th neighbor.
  • 8. LSH Forest - notMNIST results Dataset: notMNIST - grayscale images of letters from A to J in different fonts. 28x28 pixels per image Training set - 200k images Test set - 10k images Dataset random samples: LSH Forest accuracy@k for k = 1, … , 50 neighbors Accuracy of ~92% for the first neighbor, drops to ~70% for the 50th neighbor.
  • 9. Can we do better? Idea: extract higher-level features with semantic meaning. How: Unsupervised Neural Networks (Autoencoders) ● LSH Forest took 784 feature vectors as input, treating each pixel like a feature ● Extracting higher-level features would: - compress the data - be computationally efficient (features << 784) - improve performance (comparisons done between features with semantic meaning rather than raw pixels)
  • 10. Autoencoders Idea: Map input space to another space, and then reconstruct the original input from that space. The hope is that if the autoencoder is able to reconstruct the original input from a lower dimensional representation of if, it means that the representation has successfully captured important features in the input distribution. The mappings are linear in the model parameters (W, b), followed by a non-linearity, e.s. tanh(Wx + b). The loss function is usually the l2 loss (mean squared error) or the cross-entropy loss.
  • 11. Autoencoders - math Input x with dimension (N x P) Model parameters: - matrix W of dimension (P x H) - vector bh of dimension H - vector bv of dimension P h = sigma(Wx + bh) // latent representation (encoder) z = sigma(W’h + bv) // reconstruction (decoder) loss = loss_function(x, z) The parameters update rule is derived using backpropagation (i.e. computing the gradients with respect to the chosen loss function). theta’ = theta + learning_rate * gradient_wrt_theta sigma() is a non-linear activation function (common choices are sigmoid and tanh)
  • 12. Denoising Autoencoders Autoencoder variant trained to reconstruct the original input starting from a corrupted version of it (denoising task). Idea: If the autoencoder is able to reconstruct the input ( ) from a corrupted version of it ( ) we can expect that it must have captured robust features in the latent representation. Corruption method used - masking noise: a random fraction v of pixels is set to 0. Other possible methods: salt-and-pepper noise (fraction v flipped to minimum or maximum value), gaussian noise. x~ = noise(x) h = sigma(Wx~ + bh) // latent representation (encoder) z = sigma(W’h + bv) // reconstruction (decoder) loss = loss_function(x, z)
  • 13. Stacked (Denoising) Autoencoders We can take the latent representation learned by an autoencoder, and use it as training data for a second autoencoder. In this way, we can create a stack of autoencoders, where each layer learns a higher-level representation of the input distribution with respect to the layer below it. It can be theoretically proved that adding a layer to the stack will improve the variational bound on the log probability of the data, if the layers are trained properly. This procedure is called unsupervised pre-training. Once the autoencoders are trained, we can construct the final architecture (deep autoencoder) and perform unsupervised fine-tuning of all the layers together.
  • 14. Deep Autoencoder - visualize high-level features Visualization: t-distributed stochastic neighbor embedding (t-SNE) Left: t-SNE of the original MNIST (784 to 2), Right: t-SNE of the latent representation of a trained deep autoencoder (64 to 2)
  • 15. Combine Deep Autoencoders with LSH Forest ● Deep Autoencoders extract high-level meaningful features (encodings) ● Encodings dimension is typically much lower than original dimension (data compression + computational efficiency) ● Similarity can be computed at the encodings level rather than pixel level (i. e. similarity between two high-level features (e.g. eyes) is much more meaningful than similarity between pairs of pixels ● This more powerful similarity detection algorithm can be used in semi- supervised settings (more on that later)
  • 16. Deep Autoencoder + LSH Forest - MNIST - Deep Autoencoder architecture: 4 encoders - 784 → 2048 → 1024 → 256 → 64 4 decoders - 64 → 256 → 1024 → 2048 → 784 - LSH Forest ran on 64d encodings - Sample reconstructions of the model: LSH Forest accuracy@k for k = 1, … , 50 neighbors Similar accuracy for the first neighbors, +~35% deep autoencoder improvement for 50th neighbor
  • 17. Deep Autoencoder + LSH Forest - notMNIST - Deep Autoencoder architecture: 4 encoders - 784 → 2048 → 1024 → 256 → 64 4 decoders - 64 → 256 → 1024 → 2048 → 784 - LSH Forest ran on 64d encodings - Sample reconstructions of the model: LSH Forest accuracy@k for k = 1, … , 50 neighbors Similar accuracy for the first neighbors, +~16% deep autoencoder improvement for 50th neighbor
  • 18. Deep Autoencoder + LSH Forest - Query results MNIST: query image: , neighbors indexes: 1, 6, 12, 20, 27, 36, 50 Basic LSH (n. of 2: 17/50) Deep Autoencoder + LSH (n. of 2: 50/50) notMNIST: query image: , neighbors indexes: 1, 6, 12, 20, 27, 36, 50 Basic LSH (n. Of G: 7/50) Deep Autoencoder + LSH (n. Of G: 37/50)
  • 19. Can we exploit better similarity techniques in semi-supervised settings? ● Big Data (unlabeled data >>>> labeled data) ● Supervised learning algorithms need labels ● Assumption: dataset with k% unlabeled data (say 98%) and j% labeled data (say 2%) ● Can we infer the labels of the 98% of the data using the labels we have for the 2%? ● Deep Autoencoder + LSH is totally unsupervised: we can use 100% of the data ● Idea: estimate the label of an item with the label of the most similar item for which we have the label
  • 20. Two approaches First Found: Assign the label of the most similar item for which the label is known. Majority Voting: Assign the most frequent label between neighbors. The two results are nearly identical between the two approaches. Result: By knowing only 2750 / 55000 (5%) of the labels in the training set, we can infer the labels for the test set with a ~95% accuracy.
  • 21. Two more metrics The average position of the first neighbor with known label decreases as the number of known labels increases. Thus, the more labels we know, the less neighbors we have to compute with LSH. The average number of not found elements with known label decreases as the number of known label increases. For this not found elements, a random label is chosen. Anyway, for just 5% of known labels, avg not founds are just 0.2.
  • 22. Future work: ● Try majority voting weighted on distances ● Try convolutional autoencoders instead of denoising autoencoders (Huge expected improvement especially with images, provides translation and rotation invariance) ● Try with more complex datasets (e.g. CIFAR-10, 32x32 color images) ● Try the approach with other domains (e.g. sound, text, ...) ● Object similarity between domains (e.g. image of two similar to sound of person spelling “two”)