Probabilistic data structures. Part 4. Similarity

Andrii Gakhov
Andrii GakhovPh.D., Senior Software Engineer at Ferret-Go
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 4: SIMILARITY
SIMILARITY
Agenda:
▸ Locality-Sensitive Hashing (LSH)
▸ MinHash
▸ SimHash
• To find clusters of similar documents from the
document set
• To find duplicates of the document in the
document set
THE PROBLEM
LOCALITY SENSITIVE HASHING
LOCALITY SENTSITIVE HASHING
• Locality Sentsitive Hashing (LSH) was introduced by Indyk and
Motwani in 1998 as a family of functions with the following property:
• similar input objects (from the domain of such functions) have a
higher probability of colliding in the range space than non-similar
ones
• LSH differs from conventional and cryptographic hash functions because
it aims to maximize the probability of a “collision” for similar
items.
• Intuitively, LSH is based on the simple idea that, if two points are close
together, then after a “projection” operation these two points will remain
close together.
LSH: PROPERTIES
• One general approach to LSH is to “hash” items several
times, in such a way that similar items are more likely to
be hashed to the same bucket than dissimilar items are.
We then consider any pair that hashed to the same
bucket for any of the hashings to be a candidate pair.
LSH: PROPERTIES
• Examples of LSH functions are known for resemblance,
cosine and hamming distances.
• LSH functions do not exist for certain commonly used
similarity measures in information retrieval, the Dice
coefficient and the Overlap coefficient (Moses S. Charikar,
2002)
LSH: PROBLEM
• Given a query point, we wish to find the points in a large
database that are closest to the query. We wish to guarantee
with a high probability equal to 1 − δ that we return the
nearest neighbor for any query point.
• Conceptually, this problem is easily solved by iterating through each point in the
database and calculating the distance to the query object. However, our database
may contain billions of objects — each object described by a vector that contains
hundreds of dimensions.
• LSH proves useful to identify nearest neighbors quickly even
when the database is very large.
LSH: FINDING DUPLICATE PAGES ON THE WEB
How to select features for the webpage?
• document vector from page content
• Compute “document-vector” by case-folding, stop-word removal, stemming, computing
term-frequencies and finally, weighting each term by its inverse document frequency (IDF)
• shingles from page content
• Consider a document as an sequence of words.A shingle is a hash value of a k-gram
which is sub-sequence of k consequent words.
• Hashes for shingles can be computed using Rabin’s fingerprints
• connectivity information
• The idea that similar web pages have several incoming links in common.
• anchor text and anchor window
• The idea that similar documents should have similar anchor text.The words in the anchor
text or anchor window are folded into the document-vector itself.
• phrases
• Idea is to identify phrases and compute a document-vector that includes phrases as terms.
LSH: FINDING DUPLICATE PAGES ON THE WEB
• The Web contains many duplicate pages, partly because content is duplicated
across sites and partly because there is more than one URL that points to the
same page.
• If we use shingles as features, each shingle represents a portion of a Web page
and is computed by forming a histogram of the words found within that portion
of the page.
• We can test to see if a portion of the page is duplicated elsewhere on the Web
by looking for other shingles with the same histogram. If the shingles of the
new page match shingles from the database, then it is likely that the new page
bears a strong resemblance to an existing page.
• Because Web pages are surrounded by navigational and other information that
changes from site to site it is useful to use nearest-neighbor solution that can
employ LSH functions.
LSH: RETRIEVING IMAGES
• LSH can be used in image retrieval as an object recognition
tool. We compute a detailed metric for many different
orientations and configurations of an object we wish to
recognize.
• Then, given a new image we simply check our database to see
if a precomputed object’s metrics are close to our query. This
database contains millions of poses and LSH allows us to
quickly check if the query object is known.
LSH: RETRIEVING MUSIC
• In music retrieval conventional hashes and robust features are typically used to find
musical matches.
• The features can be fingerprints, i.e., representations of the audio signal that are robust to
common types of abuse that are performed to audio before it reaches our ears.
• Fingerprints can be computed, for instance, by noting the peaks in the spectrum (because they are
robust to noise) and encoding their position in time and space. User then just has to query the database
for the same fingerprint.
• However, to find similar songs we cannot use fingerprints because these are
different when a song is remixed for a new audience or when a different artist performs the
same song. Instead, we can use several seconds of the song— a snippet—as a shingle.
• To determine if two songs are similar, we need to query the database and see if a large enough number of
the query shingles are close to one song in the database.
• Although closeness depends on the feature vector, we know that long shingles provide
specificity. This is particularly important because we can eliminate duplicates to improve
search results and to link recommendation data between similar songs.
LSH: PYTHON
• https://github.com/ekzhu/datasketch

datasketch gives you probabilistic data structures that
can process vary large amount of data
• https://github.com/simonemainardi/LSHash

LSHash is a fast Python implementation of locality
sensitive hashing with persistence support
• https://github.com/go2starr/lshhdc

LSHHDC: Locality-Sensitive Hashing based High
Dimensional Clustering
LSH: READ MORE
• http://www.cs.princeton.edu/courses/archive/spr04/cos598B/
bib/IndykM-curse.pdf
• http://infolab.stanford.edu/~ullman/mmds/book.pdf
• http://www.cs.princeton.edu/courses/archive/spr04/cos598B/
bib/CharikarEstim.pdf
• http://www.matlabi.ir/wp-content/uploads/bank_papers/
g_paper/g15_Matlabi.ir_Locality-
Sensitive%20Hashing%20for%20Finding%20Nearest%20Nei
ghbors.pdf
• https://en.wikipedia.org/wiki/Locality-sensitive_hashing
MINHASH
MINHASH: RESEMBLANCE SIMILARITY
• While talking about relations of two documents A and B, we mostly interesting in such
concept as "roughly the same", that could be mathematically formalized as their
resemblance (or Jaccard similarity).
• Every document can be seen as a set of features and such issue could be reduced to a set
intersection problem (find similarity between sets).
• The resemblance r(A,B) of two documents is a number between 0 and 1, such that it is
close to 1 for the documents that are "roughly the same”:
• For duplicate detection problem we can reformulate is as a task to find pairs of
documents whose resemblance r(A,B) exceeds certain threshold R.
• For the clustering problem we can use d(A,B) = 1 - r(A,B) as a metric to design
algorithms intended to cluster a collection of documents into sets of closely resembling
documents.
J(A,B) = r(A,B) =
A∩ B
A∪ B
MINHASH
• MinHash (minwise hashing algorithm) has been proposed by Andrei
Broder in 1997
• The minwise hashing family applies a random permutation π: Ω → Ω,
on the given set S, and stores only the minimum values after the
permutation mapping.
• Formally MinHash is defined as:
hπ
min
S( )= min π S( )( )= min π1 S( ),…πk S( )( )
MINHASH: MATRIX REPRESENTATION
• Collection of sets could be visualised as characteristic matrix CM.
• The columns of the matrix correspond to the sets, and the rows
correspond to elements of the universal set from which elements of the
sets are drawn.
• CM(r, c) =1 if the element for row r is a member of the set for
column c, otherwise CM(r,c) =0
• NOTE: The characteristic matrix is unlikely to be the way the data is
stored, but it is useful as a way to visualize the data.
• For one reason not to store data as a matrix, these matrices are almost always sparse
(they have many more 0’s than 1’s) in practice.
MINHASH: MATRIX REPRESENTATION
• Consider the following sets documents (represented as bag of words):
• S1 = {python, is, a, programming, language}
• S2 = {java, is, a, programming, language}
• S3 = {a, programming, language}
• S4 = {python, is, a, snake}
• We can assume that our universal set limits to words from these 4 documents.The characteristic
matrix of these sets is
row S1 S2 S3 S4
a 0 1 1 1 1
is 1 1 1 0 1
java 2 0 1 0 0
language 3 1 1 1 0
programming 4 1 1 1 0
python 5 1 0 0 1
snake 6 0 0 0 1
MINHASH: INTUITION
• The minhash value of any column of the characteristic
matrix is the number of the first row, in the
permuted order, in which the column has a 1.
• To minhash a set represented by such columns, we need
to pick a permutation of the rows.
MINHASH: ONE STEP EXAMPLE
row S1 S2 S3 S4 permuted row
1 1 0 1 0 2
2 1 0 0 0 5
3 0 0 1 1 6
4 1 0 0 1 1
5 0 0 1 1 4
6 0 1 1 0 3
• Consider the characteristic matrix:
• Record the first “1” in each column:
m(S1) = 2, m(S2) = 3, m(S3) = 2, m(S4) = 6
• Estimate the Jaccard similarity: J(Si,Sj) = 1 if m(Si) = m(Sj), or 0 otherwise
J(S1,S3) = 1, J(S1, S4) = 0
MINHASH: INTUITION
• It is not feasible to permute a large characteristic
matrix explicitly.
• Even picking a random permutation of millions of rows is time-
consuming, and the necessary sorting of the rows would take even
more time.
• Fortunately, it is possible to simulate the effect of a
random permutation by a random hash function that
maps row numbers to as many buckets as there are rows.
• So, instead of picking n random permutations of rows, we pick n
randomly chosen hash functions h1, h2, . . . , hn on the rows
MINHASH: PROPERTIES
• Connection between minhash and resemblance (Jaccard)
similarity of the sets that are minhashed:
• The probability that the minhash function for a random permutation
of rows produces the same value for two sets equals the Jaccard
similarity of those sets
• Minhash(π) of a set is the number of the row (element) with first non-
zero in the permuted order π.
• MinHash is an LSH for resemblance (Jaccard) similarity, which defined
over binary vectors.
MINHASH: ALGORITHM
• Pick k randomly chosen hash functions h1, h2, . . . , hk
• From the column representing set S, construct the minhash signature for S - the
vector [h1(S), h2(S), . . . , hk(S)].
• Build the characteristic matrix CM for the set S.
• Construct a signature matrix SIG, where SIG(i, c) corresponds to the i
th
hash
function and column c. Initially, set SIG(i, c) = ∞ for each i and c.
• On each row r:
• Compute h1(r), h2(r), . . . , hk(r)
• For each column c:
• If CM(r,c) = 1, then set SIG(i, c) = min{SIG(i,c), hi(r)} for each i = 1. . . n
• Estimate the resemblance (Jaccard) similarities of the underlying sets from the
final signature matrix
MINHASH: EXAMPLE
• Consider 2 hash functions:
• h1(x) = x + 1 mod 7
• h2(x) = 3x + 1 mod 7
row S1 S2 S3 S4 h1(row) h2 (row)
a 0 1 1 1 1 1 1
is 1 1 1 0 1 2 4
java 2 0 1 0 0 3 0
language 3 1 1 1 0 4 3
programming 4 1 1 1 0 5 6
python 5 1 0 0 1 6 2
snake 6 0 0 0 1 0 5
• Consider the same set of 4 documents and universal set that consists of 7 words
MINHASH: EXAMPLE
S1 S2 S3 S4
h1 ∞ ∞ ∞ ∞
h2 ∞ ∞ ∞ ∞
• Initially the signature matrix will have all ∞:
• First, consider row 0, all values h1(0) and h2(0) are both 1. So, for all sets with 1 in
the row 0 we set value h1(0)=h2(0)=1 in the signature matrix:
r S1 S2 S3 S4 h1 h2
0 1 1 1 1 1 1
1 1 1 0 1 2 4
2 0 1 0 0 3 0
3 1 1 1 0 4 3
4 1 1 1 0 5 6
5 1 0 0 1 6 2
6 0 0 0 1 0 5
S1 S2 S3 S4
h1 1 1 1 1
h2 1 1 1 1
• Next, consider row 1, its values h1(1) = 2 and h2(1) = 4. In this row only S1, S2
and S4 have 1’s, so we set the appropriate cells with the minimum between
existing values and 2 for h1 or 4 for h2:
S1 S2 S3 S4
h1 1 1 1 1
h2 1 1 1 1
MINHASH: EXAMPLE
S1 S2 S3 S4
h1 1 1 1 1
h2 1 0 1 1
• Continue with other rows and after row 5 the signature matrix is:
• Finally, consider row 6, its values h1(6) = 0 and h2(6) = 5. In this row only S4
has 1’s, so we set the appropriate cells with the minimum between existing
values and 0 for h1 or 5 for h2.The final signature matrix is:
S1 S2 S3 S4
h1 1 1 1 0
h2 1 0 1 1
r S1 S2 S3 S4 h1 h2
0 1 1 1 1 1 1
1 1 1 0 1 2 4
2 0 1 0 0 3 0
3 1 1 1 0 4 3
4 1 1 1 0 5 6
5 1 0 0 1 6 2
6 0 0 0 1 0 5
• Document S1 is similar to S3 (identical columns), so similarity is 1 (the true
value J ≈ 0.75)
• Documents S2 and S4 have no rows in common, so similarity is 0 (the true
value J ≈ 0.28)
MINHASH: PYTHON
• https://github.com/ekzhu/datasketch

datasketch gives you probabilistic data structures that
can process vary large amount of data
• https://github.com/anthonygarvan/MinHash

MinHash is an effective pure python implementation of
Minhash
MINHASH: READ MORE
• http://infolab.stanford.edu/~ullman/mmds/book.pdf
• http://www.cs.cmu.edu/~guyb/realworld/slidesS13/
minhash.pdf
• https://en.wikipedia.org/wiki/MinHash
• http://www2007.org/papers/paper570.pdf
• https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-
Jaccard+Shingle.pdf
B-BIT MINHASH
B-BIT MINHASH
• A modification of minwise hashing, called b-bit minwise
hashing, was proposed by Ping Li and Arnd Christian König
in 2010.
• The method of b-bit minwise hashing provides a simple
solution by storing only the lowest b bits of each hashed
data, reducing the dimensionality of the (expanded)
hashed data matrix to just 2b
•k.
• b-bit minwise hashing is simple and requires only minimal
modification to the original minwise hashing algorithm.
B-BIT MINHASH
• Intuitively, using fewer bits per sample will increase the estimation
variance, compared to the original MinHash, at the same "sample size" k.
Thus, we have to increase k to maintain the same accuracy.
• Interestingly, the theoretical results demonstrate that, when resemblance is not
too small (e.g. R ≥ 0.5 as a common threshold), we do not have to increase
k much.
• This means that, compared to the earlier approach, the b-bit minwise hashing can be used to improve
estimation accuracy and significantly reduce storage requirement at the same time.
• For example, for b=1 and R=0.5, the estimation variance will increase at
most by factor of 3.
• So, in order not to lose accuracy, we have to increase the sample size k by factor 3.
• If we originally stored each hashed value using 64 bits, the improvement by using
b=1 will be 64/3 = 21.3
B-BIT MINHASH: READ MORE
• https://www.microsoft.com/en-us/research/publication/b-
bit-minwise-hashing/
• https://www.microsoft.com/en-us/research/publication/
theory-and-applications-of-b-bit-minwise-hashing/
SIMHASH
SIMHASH
• SimHash (a sign normal random projection) algorithm has been
proposed by Moses S. Charikar in 2002.
• SimHash originate form the concept of sign random projections (SRP):
• In short, for a given vector x SRP utilizes a random vector w with each component
generated from i.i.d normal, i.e. wi ~ N(0,1), and only stores the sign of the
projected data.
• Currently, SimHash is the only known LSH for cosine similarity
sim(A,B) =
A⋅ B
A ⋅ B
=
aibi
i=1
n
∑
ai
2
i=1
n
∑ bi
2
i=1
n
∑
SIMHASH
• In fact, it is a dimensionality reduction technique that maps high-
dimensional vectors to ƒ-bit fingerprints, where ƒ is small (e.g. 32 or 64).
• SimHash algorithm consider a fingerprint as a "hash" of its properties
and assumes that similar documents have similar hash values.
• This assumption requires the hash functions to be LSH and, as we already know, it
isn't trivial as it could be seen for the first time.
• If we consider 2 documents that are different just in a single byte and apply such
popular hash functions as SHA-1 or MD5, they will definitely get two completely
different hash-values, that aren't close.This is why such approach requires a
special rule how to define hash functions that could have such required property.
• An attractive feature of the SimHash hash functions is that the output is
a single bit (the sign)
SIMHASH: FINGERPRINTS
• to generate an ƒ-bit fingerprint, maintain an ƒ-dimensional vector V (all
dimensions are 0 at the beginning).
• a feature is hashed into an ƒ-bit hash value (using a hash function) and can be
seen as a binary vector of ƒ bits
• these ƒ bits (unique to the feature) increment/decrement the ƒ components of
the vector by the weight of that feature as follows:
• if the ith
bit of the hash value is 1 => the ith
component of V is incremented by the
weight of that feature
• if the ith
bit of the hash value is 0 => the ith
component of V is decremented by the
weight of that feature.
• When all features have been processed, components of V are positive or
negative.
• The signs of components determine the corresponding bits of the final fingerprint.
SIMHASH: FINGERPRINTS EXAMPLE
• Consider a document after POS tagging: {(“ferret”, NOUN), (“go”, VERB)}
• decide to have weights for features based on their POS tags: {“NOUN”: 1.0, “VERB”: 0.9}
• h(x) - some hash function, and we will calculate 16-bit (ƒ=16) fingerprint F
• Maintain V = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) for the document
• h(“hello”) = 0101010101010111
• h(“hello”)[1] == 0 => V[1] = V[1] - 1.0 = -1.0
• h(“hello”)[2] == 1 => V[2] = V[2] + 1.0 = 1.0
• …
• V = (-1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, 1.0, 1.0)
• h(“go”) = 0111011100100111
• h(“go”)[1] == 0 => V[1] = V[1] - 0.9 = -1.9
• h(“go”)[2] == 1 => V[2] = V[2] + 0.9 = 1.9
• …
• V = (-1.9, 1.9, -0.1, 1.9, -1.9, 1.9, -0.1, 1.9, -1.9, 0.1, -0.1, 0.1, -1.9, 1.9, 1.9, 1.9)
• sign(V) = (-, +, -, +, -, +, -, +, -, +, -, +, -, +, +, +)
• fingerprint F = 0101010101010111
SIMHASH: FINGERPRINTS
After converting document into simhash fingerprints, we face the following
design problem:
• Given a ƒ-bit fingerprint of a document, how quickly discover other
fingerprints that differ in at most k bit-positions?
SIMHASH: DATA STRUCTURE
• SimHash data structure consists of t tables: T1, T2, … Tt.With each table
Ti is associated 2 quantities:
• an integer pi (number of top bit-positions used in comparisons)
• a permutation πi over the ƒ bit-positions
• Table Ti is constructed by applying permutation πi to each existing
fingerprint F and the resulting set of permuted ƒ-bit fingerprints is sorted.
• To optimize such structure to store it in the main-memory, it is possible to compress
each such table (that can decrease it size by half approximately).
• Example:
DF = {0110, 0010, 0010, 1001, 1110}

πi = {1→2, 2→4, 3→3, 4→1 }
• πi (0110)= 0011, πi (0010)= 0010,

πi (1001)= 1100, πi (1110)= 0111
Ti =
0 0 1 0
0 0 1 1
0 0 1 1
0 1 1 1
1 1 0 0
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
SIMHASH: ALGORITHM
• Actually, the algorithm follows the general approach we saw for LSH
• Given a fingerprint F and an integer k we probe tables from the data structure in
parallel:
• For each table Ti we identify all permuted fingerprints in Ti whose top pi bit-
positions match the top pi bit-positions of πi(F).
• After that, for each such permuted fingerprints, we check if it differs from the
πi(F) in at most k bit-positions (in fact, comparing the Hamming distance
between such fingerprints).
• Identification of the first fingerprint in table Ti whose top pi bit-positions match
the top pi bit-positions of πi(F) (Step 1) can be done in O(pi) steps by binary
search.
• If we assume that each fingerprint were truly a random bit sequence, it is possible
to shrinks the run time to O(log pi) steps in expectation (interpolation search).
SIMHASH: PARAMETERS
• For a repository of 8B web-pages, ƒ=64 (64-bit simhash fingerprints)
and k = 3 are reasonable. (Gurmeet Singh Manku et al., 2007)
• For the given ƒ and k, to find a reasonable combination of t (number of
tables) and pi (number of top bit-positions used in comparisons) we have
to note that they are not independent from each other:
• Increasing the number of tables t increases pi and hence reduces the query time.
(large values for various pi let to avoid checking too many fingerprints in Step 2)
• Decreasing the number of tables t reduces storage requirements, but reduces pi
and hence increases the query time (small set of permutations let to avoid space
blowup)
SIMHASH: PARAMETERS
• Consider ƒ = 64 (64-bit fingerprints) and k = 3, so documents whose fingerprints
differ at most 3 bit-positions will be similar for us (duplicates).
• Assume we have 8B = 234
existing fingerprints and decide to use t = 10 tables:
• For instance, t = 10 = (5
2) , so we can use 5 blocks and choose 2 out of them in
10 ways.
• Split 64 bits into 5 blocks having 13, 13, 13, 13 and 12 bits respectively.
• For each such choice, permutation π corresponds to making the bits lying in the
chosen blocks the leading bits.
• pi is the total number of bits in the chosen blocks, thus pi = 25 or 26.
• On average, a probe retrieves at most 234−25
= 512 (permuted) fingerprints.
• if we seek all (permuted) fingerprints which match the top pi bits of a given (permuted)
fingerprint, we expect 2d−pi
fingerprints as matches
• The total number of probes for 64-bit fingerprints and k = 3 is (64
3) = 41664 probes
SIMHASH: PYTHON
• https://github.com/leonsunliang/simhash

A Python implementation of Simhash.
• http://bibliographie-trac.ub.rub.de/browser/simhash.py

Implementation of Charikar's simhashes in Python
SIMHASH: READ MORE
• http://www.cs.princeton.edu/courses/archive/spr04/
cos598B/bib/CharikarEstim.pdf
• http://jmlr.org/proceedings/papers/v33/
shrivastava14.pdf
• http://www.wwwconference.org/www2007/papers/
paper215.pdf
• http://ferd.ca/simhashing-hopefully-made-simple.html
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU
1 of 46

Recommended

Apache Ambari: Past, Present, Future by
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureHortonworks
3.2K views97 slides
하둡 HDFS 훑어보기 by
하둡 HDFS 훑어보기하둡 HDFS 훑어보기
하둡 HDFS 훑어보기beom kyun choi
24.2K views20 slides
카프카 기반의 대규모 모니터링 플랫폼 개발이야기 by
카프카 기반의 대규모 모니터링 플랫폼 개발이야기카프카 기반의 대규모 모니터링 플랫폼 개발이야기
카프카 기반의 대규모 모니터링 플랫폼 개발이야기if kakao
2K views46 slides
Apache HBase™ by
Apache HBase™Apache HBase™
Apache HBase™Prashant Gupta
3.8K views58 slides
Benchmarking NGINX for Accuracy and Results by
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsNGINX, Inc.
2.8K views28 slides
하둡 맵리듀스 훑어보기 by
하둡 맵리듀스 훑어보기하둡 맵리듀스 훑어보기
하둡 맵리듀스 훑어보기beom kyun choi
11.4K views26 slides

More Related Content

What's hot

HBase Tales From the Trenches - Short stories about most common HBase operati... by
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
1.8K views18 slides
Everything you always wanted to know about Redis but were afraid to ask by
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askCarlos Abalde
26.9K views73 slides
HDFS Namenode High Availability by
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High AvailabilityHortonworks
11.6K views18 slides
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ... by
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
637 views29 slides
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman by
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedis Labs
2.4K views26 slides
Hug Hbase Presentation. by
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
9.5K views9 slides

What's hot(20)

HBase Tales From the Trenches - Short stories about most common HBase operati... by DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit1.8K views
Everything you always wanted to know about Redis but were afraid to ask by Carlos Abalde
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to ask
Carlos Abalde26.9K views
HDFS Namenode High Availability by Hortonworks
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
Hortonworks11.6K views
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ... by Simplilearn
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn637 views
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman by Redis Labs
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
Redis Labs2.4K views
Hug Hbase Presentation. by Jack Levin
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin9.5K views
automatic classification in information retrieval by Basma Gamal
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
Basma Gamal1.6K views
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T... by Simplilearn
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn2.6K views
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial... by CloudxLab
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab450 views
My first 90 days with ClickHouse.pdf by Alkin Tezuysal
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal501 views
MariaDB Performance Tuning Crash Course by Severalnines
MariaDB Performance Tuning Crash CourseMariaDB Performance Tuning Crash Course
MariaDB Performance Tuning Crash Course
Severalnines1.3K views
Indicator Random Variables by Daniel Raditya
Indicator Random VariablesIndicator Random Variables
Indicator Random Variables
Daniel Raditya1.1K views
Tuning Apache Ambari performance for Big Data at scale with 3000 agents by DataWorks Summit
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
DataWorks Summit2.2K views
Scalable and Available, Patterns for Success by Derek Collison
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
Derek Collison20.2K views
Introduction to Redis by Dvir Volk
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk121.1K views
WMTS Performance Tests by Roope Tervo
WMTS Performance TestsWMTS Performance Tests
WMTS Performance Tests
Roope Tervo1.3K views

Viewers also liked

Probabilistic data structures. Part 3. Frequency by
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
1.7K views31 slides
Probabilistic data structures. Part 2. Cardinality by
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
1.7K views26 slides
Bloom filter by
Bloom filterBloom filter
Bloom filterfeng lee
3.1K views20 slides
Bloom filters by
Bloom filtersBloom filters
Bloom filtersDevesh Maru
5.5K views23 slides
Data Mining - lecture 7 - 2014 by
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
2.5K views26 slides
爬虫点滴 by
爬虫点滴爬虫点滴
爬虫点滴Open Party
1.4K views28 slides

Viewers also liked(20)

Probabilistic data structures. Part 3. Frequency by Andrii Gakhov
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov1.7K views
Probabilistic data structures. Part 2. Cardinality by Andrii Gakhov
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov1.7K views
Bloom filter by feng lee
Bloom filterBloom filter
Bloom filter
feng lee3.1K views
Bloom filters by Devesh Maru
Bloom filtersBloom filters
Bloom filters
Devesh Maru5.5K views
Data Mining - lecture 7 - 2014 by Andrii Gakhov
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
Andrii Gakhov2.5K views
爬虫点滴 by Open Party
爬虫点滴爬虫点滴
爬虫点滴
Open Party1.4K views
Вероятностные структуры данных by Andrii Gakhov
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
Andrii Gakhov1.3K views
Big Data with Semantics - StampedeCon 2012 by StampedeCon
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012
StampedeCon1.8K views
Thinking in MapReduce - StampedeCon 2013 by StampedeCon
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
StampedeCon1.7K views
[Vldb 2013] skyline operator on anti correlated distributions by WooSung Choi
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
WooSung Choi490 views
Learning a nonlinear embedding by preserving class neibourhood structure 최종 by WooSung Choi
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
WooSung Choi298 views
Dsh data sensitive hashing for high dimensional k-nn search by WooSung Choi
Dsh  data sensitive hashing for high dimensional k-nn searchDsh  data sensitive hashing for high dimensional k-nn search
Dsh data sensitive hashing for high dimensional k-nn search
WooSung Choi354 views
Economic Development and Satellite Images by Paul Raschky
Economic Development and Satellite ImagesEconomic Development and Satellite Images
Economic Development and Satellite Images
Paul Raschky539 views
An optimal and progressive algorithm for skyline queries slide by WooSung Choi
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi686 views
Implementing a Fileserver with Nginx and Lua by Andrii Gakhov
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
Andrii Gakhov2.1K views
Bloom filter by wang ping
Bloom filterBloom filter
Bloom filter
wang ping2.9K views
skip list by iammutex
skip listskip list
skip list
iammutex9.4K views

Similar to Probabilistic data structures. Part 4. Similarity

Local sensitive hashing & minhash on facebook friend by
Local sensitive hashing & minhash on facebook friendLocal sensitive hashing & minhash on facebook friend
Local sensitive hashing & minhash on facebook friendChengeng Ma
836 views30 slides
Hashing And Hashing Tables by
Hashing And Hashing TablesHashing And Hashing Tables
Hashing And Hashing TablesChinmaya M. N
43 views19 slides
Hashing by
HashingHashing
HashingAmar Jukuntla
17.6K views22 slides
Building graphs to discover information by David Martínez at Big Data Spain 2015 by
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
729 views32 slides
Open LSH - september 2014 update by
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
1.4K views22 slides
Mining of massive datasets using locality sensitive hashing (LSH) by
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
8.5K views15 slides

Similar to Probabilistic data structures. Part 4. Similarity(20)

Local sensitive hashing & minhash on facebook friend by Chengeng Ma
Local sensitive hashing & minhash on facebook friendLocal sensitive hashing & minhash on facebook friend
Local sensitive hashing & minhash on facebook friend
Chengeng Ma836 views
Building graphs to discover information by David Martínez at Big Data Spain 2015 by Big Data Spain
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain729 views
Open LSH - september 2014 update by J Singh
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
J Singh1.4K views
Mining of massive datasets using locality sensitive hashing (LSH) by J Singh
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh8.5K views
introduction to trees,graphs,hashing by Akhil Prem
introduction to trees,graphs,hashingintroduction to trees,graphs,hashing
introduction to trees,graphs,hashing
Akhil Prem216 views
Data Step Hash Object vs SQL Join by Geoff Ness
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
Geoff Ness795 views
Algorithms for Query Processing and Optimization of Spatial Operations by Natasha Mandal
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial Operations
Natasha Mandal845 views
Segmentation - based Historical Handwritten Word Spotting using document-spec... by Konstantinos Zagoris
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto... by DECK36
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK366.1K views
Conceptos básicos. Seminario web 1: Introducción a NoSQL by MongoDB
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
MongoDB10.7K views
Chapter 4.pptx by Tekle12
Chapter 4.pptxChapter 4.pptx
Chapter 4.pptx
Tekle1215 views
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende... by AIST
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
AIST496 views
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی by Ehsan Asgarian
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Ehsan Asgarian355 views
Sharing a Startup’s Big Data Lessons by George Stathis
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis600 views

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture by
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
423 views90 slides
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat... by
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
735 views39 slides
Too Much Data? - Just Sample, Just Hash, ... by
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
386 views23 slides
DNS Delegation by
DNS DelegationDNS Delegation
DNS DelegationAndrii Gakhov
902 views15 slides
Pecha Kucha: Ukrainian Food Traditions by
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
839 views20 slides
Recurrent Neural Networks. Part 1: Theory by
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
14.2K views36 slides

More from Andrii Gakhov(20)

Let's start GraphQL: structure, behavior, and architecture by Andrii Gakhov
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
Andrii Gakhov423 views
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat... by Andrii Gakhov
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov735 views
Too Much Data? - Just Sample, Just Hash, ... by Andrii Gakhov
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov386 views
Pecha Kucha: Ukrainian Food Traditions by Andrii Gakhov
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
Andrii Gakhov839 views
Recurrent Neural Networks. Part 1: Theory by Andrii Gakhov
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov14.2K views
Apache Big Data Europe 2015: Selected Talks by Andrii Gakhov
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov716 views
Swagger / Quick Start Guide by Andrii Gakhov
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
Andrii Gakhov7.6K views
API Days Berlin highlights by Andrii Gakhov
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
Andrii Gakhov787 views
ELK - What's new and showcases by Andrii Gakhov
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
Andrii Gakhov938 views
Apache Spark Overview @ ferret by Andrii Gakhov
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov1.2K views
Data Mining - lecture 8 - 2014 by Andrii Gakhov
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
Andrii Gakhov1.3K views
Data Mining - lecture 6 - 2014 by Andrii Gakhov
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
Andrii Gakhov1.1K views
Data Mining - lecture 5 - 2014 by Andrii Gakhov
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
Andrii Gakhov816 views
Data Mining - lecture 4 - 2014 by Andrii Gakhov
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
Andrii Gakhov1K views
Data Mining - lecture 3 - 2014 by Andrii Gakhov
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
Andrii Gakhov1K views
Decision Theory - lecture 1 (introduction) by Andrii Gakhov
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
Andrii Gakhov1.4K views
Data Mining - lecture 2 - 2014 by Andrii Gakhov
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
Andrii Gakhov824 views
Data Mining - lecture 1 - 2014 by Andrii Gakhov
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014
Andrii Gakhov2.1K views
Buzzwords 2014 / Overview / part2 by Andrii Gakhov
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2
Andrii Gakhov634 views

Recently uploaded

Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
64 views27 slides
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...ShapeBlue
198 views20 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
139 views31 slides
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
263 views23 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
138 views15 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
147 views23 slides

Recently uploaded(20)

Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty64 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue198 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue161 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue173 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker54 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE79 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue123 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue238 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue206 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue186 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue297 views

Probabilistic data structures. Part 4. Similarity

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 4: SIMILARITY
  • 3. • To find clusters of similar documents from the document set • To find duplicates of the document in the document set THE PROBLEM
  • 5. LOCALITY SENTSITIVE HASHING • Locality Sentsitive Hashing (LSH) was introduced by Indyk and Motwani in 1998 as a family of functions with the following property: • similar input objects (from the domain of such functions) have a higher probability of colliding in the range space than non-similar ones • LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items. • Intuitively, LSH is based on the simple idea that, if two points are close together, then after a “projection” operation these two points will remain close together.
  • 6. LSH: PROPERTIES • One general approach to LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair.
  • 7. LSH: PROPERTIES • Examples of LSH functions are known for resemblance, cosine and hamming distances. • LSH functions do not exist for certain commonly used similarity measures in information retrieval, the Dice coefficient and the Overlap coefficient (Moses S. Charikar, 2002)
  • 8. LSH: PROBLEM • Given a query point, we wish to find the points in a large database that are closest to the query. We wish to guarantee with a high probability equal to 1 − δ that we return the nearest neighbor for any query point. • Conceptually, this problem is easily solved by iterating through each point in the database and calculating the distance to the query object. However, our database may contain billions of objects — each object described by a vector that contains hundreds of dimensions. • LSH proves useful to identify nearest neighbors quickly even when the database is very large.
  • 9. LSH: FINDING DUPLICATE PAGES ON THE WEB How to select features for the webpage? • document vector from page content • Compute “document-vector” by case-folding, stop-word removal, stemming, computing term-frequencies and finally, weighting each term by its inverse document frequency (IDF) • shingles from page content • Consider a document as an sequence of words.A shingle is a hash value of a k-gram which is sub-sequence of k consequent words. • Hashes for shingles can be computed using Rabin’s fingerprints • connectivity information • The idea that similar web pages have several incoming links in common. • anchor text and anchor window • The idea that similar documents should have similar anchor text.The words in the anchor text or anchor window are folded into the document-vector itself. • phrases • Idea is to identify phrases and compute a document-vector that includes phrases as terms.
  • 10. LSH: FINDING DUPLICATE PAGES ON THE WEB • The Web contains many duplicate pages, partly because content is duplicated across sites and partly because there is more than one URL that points to the same page. • If we use shingles as features, each shingle represents a portion of a Web page and is computed by forming a histogram of the words found within that portion of the page. • We can test to see if a portion of the page is duplicated elsewhere on the Web by looking for other shingles with the same histogram. If the shingles of the new page match shingles from the database, then it is likely that the new page bears a strong resemblance to an existing page. • Because Web pages are surrounded by navigational and other information that changes from site to site it is useful to use nearest-neighbor solution that can employ LSH functions.
  • 11. LSH: RETRIEVING IMAGES • LSH can be used in image retrieval as an object recognition tool. We compute a detailed metric for many different orientations and configurations of an object we wish to recognize. • Then, given a new image we simply check our database to see if a precomputed object’s metrics are close to our query. This database contains millions of poses and LSH allows us to quickly check if the query object is known.
  • 12. LSH: RETRIEVING MUSIC • In music retrieval conventional hashes and robust features are typically used to find musical matches. • The features can be fingerprints, i.e., representations of the audio signal that are robust to common types of abuse that are performed to audio before it reaches our ears. • Fingerprints can be computed, for instance, by noting the peaks in the spectrum (because they are robust to noise) and encoding their position in time and space. User then just has to query the database for the same fingerprint. • However, to find similar songs we cannot use fingerprints because these are different when a song is remixed for a new audience or when a different artist performs the same song. Instead, we can use several seconds of the song— a snippet—as a shingle. • To determine if two songs are similar, we need to query the database and see if a large enough number of the query shingles are close to one song in the database. • Although closeness depends on the feature vector, we know that long shingles provide specificity. This is particularly important because we can eliminate duplicates to improve search results and to link recommendation data between similar songs.
  • 13. LSH: PYTHON • https://github.com/ekzhu/datasketch
 datasketch gives you probabilistic data structures that can process vary large amount of data • https://github.com/simonemainardi/LSHash
 LSHash is a fast Python implementation of locality sensitive hashing with persistence support • https://github.com/go2starr/lshhdc
 LSHHDC: Locality-Sensitive Hashing based High Dimensional Clustering
  • 14. LSH: READ MORE • http://www.cs.princeton.edu/courses/archive/spr04/cos598B/ bib/IndykM-curse.pdf • http://infolab.stanford.edu/~ullman/mmds/book.pdf • http://www.cs.princeton.edu/courses/archive/spr04/cos598B/ bib/CharikarEstim.pdf • http://www.matlabi.ir/wp-content/uploads/bank_papers/ g_paper/g15_Matlabi.ir_Locality- Sensitive%20Hashing%20for%20Finding%20Nearest%20Nei ghbors.pdf • https://en.wikipedia.org/wiki/Locality-sensitive_hashing
  • 16. MINHASH: RESEMBLANCE SIMILARITY • While talking about relations of two documents A and B, we mostly interesting in such concept as "roughly the same", that could be mathematically formalized as their resemblance (or Jaccard similarity). • Every document can be seen as a set of features and such issue could be reduced to a set intersection problem (find similarity between sets). • The resemblance r(A,B) of two documents is a number between 0 and 1, such that it is close to 1 for the documents that are "roughly the same”: • For duplicate detection problem we can reformulate is as a task to find pairs of documents whose resemblance r(A,B) exceeds certain threshold R. • For the clustering problem we can use d(A,B) = 1 - r(A,B) as a metric to design algorithms intended to cluster a collection of documents into sets of closely resembling documents. J(A,B) = r(A,B) = A∩ B A∪ B
  • 17. MINHASH • MinHash (minwise hashing algorithm) has been proposed by Andrei Broder in 1997 • The minwise hashing family applies a random permutation π: Ω → Ω, on the given set S, and stores only the minimum values after the permutation mapping. • Formally MinHash is defined as: hπ min S( )= min π S( )( )= min π1 S( ),…πk S( )( )
  • 18. MINHASH: MATRIX REPRESENTATION • Collection of sets could be visualised as characteristic matrix CM. • The columns of the matrix correspond to the sets, and the rows correspond to elements of the universal set from which elements of the sets are drawn. • CM(r, c) =1 if the element for row r is a member of the set for column c, otherwise CM(r,c) =0 • NOTE: The characteristic matrix is unlikely to be the way the data is stored, but it is useful as a way to visualize the data. • For one reason not to store data as a matrix, these matrices are almost always sparse (they have many more 0’s than 1’s) in practice.
  • 19. MINHASH: MATRIX REPRESENTATION • Consider the following sets documents (represented as bag of words): • S1 = {python, is, a, programming, language} • S2 = {java, is, a, programming, language} • S3 = {a, programming, language} • S4 = {python, is, a, snake} • We can assume that our universal set limits to words from these 4 documents.The characteristic matrix of these sets is row S1 S2 S3 S4 a 0 1 1 1 1 is 1 1 1 0 1 java 2 0 1 0 0 language 3 1 1 1 0 programming 4 1 1 1 0 python 5 1 0 0 1 snake 6 0 0 0 1
  • 20. MINHASH: INTUITION • The minhash value of any column of the characteristic matrix is the number of the first row, in the permuted order, in which the column has a 1. • To minhash a set represented by such columns, we need to pick a permutation of the rows.
  • 21. MINHASH: ONE STEP EXAMPLE row S1 S2 S3 S4 permuted row 1 1 0 1 0 2 2 1 0 0 0 5 3 0 0 1 1 6 4 1 0 0 1 1 5 0 0 1 1 4 6 0 1 1 0 3 • Consider the characteristic matrix: • Record the first “1” in each column: m(S1) = 2, m(S2) = 3, m(S3) = 2, m(S4) = 6 • Estimate the Jaccard similarity: J(Si,Sj) = 1 if m(Si) = m(Sj), or 0 otherwise J(S1,S3) = 1, J(S1, S4) = 0
  • 22. MINHASH: INTUITION • It is not feasible to permute a large characteristic matrix explicitly. • Even picking a random permutation of millions of rows is time- consuming, and the necessary sorting of the rows would take even more time. • Fortunately, it is possible to simulate the effect of a random permutation by a random hash function that maps row numbers to as many buckets as there are rows. • So, instead of picking n random permutations of rows, we pick n randomly chosen hash functions h1, h2, . . . , hn on the rows
  • 23. MINHASH: PROPERTIES • Connection between minhash and resemblance (Jaccard) similarity of the sets that are minhashed: • The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets • Minhash(π) of a set is the number of the row (element) with first non- zero in the permuted order π. • MinHash is an LSH for resemblance (Jaccard) similarity, which defined over binary vectors.
  • 24. MINHASH: ALGORITHM • Pick k randomly chosen hash functions h1, h2, . . . , hk • From the column representing set S, construct the minhash signature for S - the vector [h1(S), h2(S), . . . , hk(S)]. • Build the characteristic matrix CM for the set S. • Construct a signature matrix SIG, where SIG(i, c) corresponds to the i th hash function and column c. Initially, set SIG(i, c) = ∞ for each i and c. • On each row r: • Compute h1(r), h2(r), . . . , hk(r) • For each column c: • If CM(r,c) = 1, then set SIG(i, c) = min{SIG(i,c), hi(r)} for each i = 1. . . n • Estimate the resemblance (Jaccard) similarities of the underlying sets from the final signature matrix
  • 25. MINHASH: EXAMPLE • Consider 2 hash functions: • h1(x) = x + 1 mod 7 • h2(x) = 3x + 1 mod 7 row S1 S2 S3 S4 h1(row) h2 (row) a 0 1 1 1 1 1 1 is 1 1 1 0 1 2 4 java 2 0 1 0 0 3 0 language 3 1 1 1 0 4 3 programming 4 1 1 1 0 5 6 python 5 1 0 0 1 6 2 snake 6 0 0 0 1 0 5 • Consider the same set of 4 documents and universal set that consists of 7 words
  • 26. MINHASH: EXAMPLE S1 S2 S3 S4 h1 ∞ ∞ ∞ ∞ h2 ∞ ∞ ∞ ∞ • Initially the signature matrix will have all ∞: • First, consider row 0, all values h1(0) and h2(0) are both 1. So, for all sets with 1 in the row 0 we set value h1(0)=h2(0)=1 in the signature matrix: r S1 S2 S3 S4 h1 h2 0 1 1 1 1 1 1 1 1 1 0 1 2 4 2 0 1 0 0 3 0 3 1 1 1 0 4 3 4 1 1 1 0 5 6 5 1 0 0 1 6 2 6 0 0 0 1 0 5 S1 S2 S3 S4 h1 1 1 1 1 h2 1 1 1 1 • Next, consider row 1, its values h1(1) = 2 and h2(1) = 4. In this row only S1, S2 and S4 have 1’s, so we set the appropriate cells with the minimum between existing values and 2 for h1 or 4 for h2: S1 S2 S3 S4 h1 1 1 1 1 h2 1 1 1 1
  • 27. MINHASH: EXAMPLE S1 S2 S3 S4 h1 1 1 1 1 h2 1 0 1 1 • Continue with other rows and after row 5 the signature matrix is: • Finally, consider row 6, its values h1(6) = 0 and h2(6) = 5. In this row only S4 has 1’s, so we set the appropriate cells with the minimum between existing values and 0 for h1 or 5 for h2.The final signature matrix is: S1 S2 S3 S4 h1 1 1 1 0 h2 1 0 1 1 r S1 S2 S3 S4 h1 h2 0 1 1 1 1 1 1 1 1 1 0 1 2 4 2 0 1 0 0 3 0 3 1 1 1 0 4 3 4 1 1 1 0 5 6 5 1 0 0 1 6 2 6 0 0 0 1 0 5 • Document S1 is similar to S3 (identical columns), so similarity is 1 (the true value J ≈ 0.75) • Documents S2 and S4 have no rows in common, so similarity is 0 (the true value J ≈ 0.28)
  • 28. MINHASH: PYTHON • https://github.com/ekzhu/datasketch
 datasketch gives you probabilistic data structures that can process vary large amount of data • https://github.com/anthonygarvan/MinHash
 MinHash is an effective pure python implementation of Minhash
  • 29. MINHASH: READ MORE • http://infolab.stanford.edu/~ullman/mmds/book.pdf • http://www.cs.cmu.edu/~guyb/realworld/slidesS13/ minhash.pdf • https://en.wikipedia.org/wiki/MinHash • http://www2007.org/papers/paper570.pdf • https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4- Jaccard+Shingle.pdf
  • 31. B-BIT MINHASH • A modification of minwise hashing, called b-bit minwise hashing, was proposed by Ping Li and Arnd Christian König in 2010. • The method of b-bit minwise hashing provides a simple solution by storing only the lowest b bits of each hashed data, reducing the dimensionality of the (expanded) hashed data matrix to just 2b •k. • b-bit minwise hashing is simple and requires only minimal modification to the original minwise hashing algorithm.
  • 32. B-BIT MINHASH • Intuitively, using fewer bits per sample will increase the estimation variance, compared to the original MinHash, at the same "sample size" k. Thus, we have to increase k to maintain the same accuracy. • Interestingly, the theoretical results demonstrate that, when resemblance is not too small (e.g. R ≥ 0.5 as a common threshold), we do not have to increase k much. • This means that, compared to the earlier approach, the b-bit minwise hashing can be used to improve estimation accuracy and significantly reduce storage requirement at the same time. • For example, for b=1 and R=0.5, the estimation variance will increase at most by factor of 3. • So, in order not to lose accuracy, we have to increase the sample size k by factor 3. • If we originally stored each hashed value using 64 bits, the improvement by using b=1 will be 64/3 = 21.3
  • 33. B-BIT MINHASH: READ MORE • https://www.microsoft.com/en-us/research/publication/b- bit-minwise-hashing/ • https://www.microsoft.com/en-us/research/publication/ theory-and-applications-of-b-bit-minwise-hashing/
  • 35. SIMHASH • SimHash (a sign normal random projection) algorithm has been proposed by Moses S. Charikar in 2002. • SimHash originate form the concept of sign random projections (SRP): • In short, for a given vector x SRP utilizes a random vector w with each component generated from i.i.d normal, i.e. wi ~ N(0,1), and only stores the sign of the projected data. • Currently, SimHash is the only known LSH for cosine similarity sim(A,B) = A⋅ B A ⋅ B = aibi i=1 n ∑ ai 2 i=1 n ∑ bi 2 i=1 n ∑
  • 36. SIMHASH • In fact, it is a dimensionality reduction technique that maps high- dimensional vectors to ƒ-bit fingerprints, where ƒ is small (e.g. 32 or 64). • SimHash algorithm consider a fingerprint as a "hash" of its properties and assumes that similar documents have similar hash values. • This assumption requires the hash functions to be LSH and, as we already know, it isn't trivial as it could be seen for the first time. • If we consider 2 documents that are different just in a single byte and apply such popular hash functions as SHA-1 or MD5, they will definitely get two completely different hash-values, that aren't close.This is why such approach requires a special rule how to define hash functions that could have such required property. • An attractive feature of the SimHash hash functions is that the output is a single bit (the sign)
  • 37. SIMHASH: FINGERPRINTS • to generate an ƒ-bit fingerprint, maintain an ƒ-dimensional vector V (all dimensions are 0 at the beginning). • a feature is hashed into an ƒ-bit hash value (using a hash function) and can be seen as a binary vector of ƒ bits • these ƒ bits (unique to the feature) increment/decrement the ƒ components of the vector by the weight of that feature as follows: • if the ith bit of the hash value is 1 => the ith component of V is incremented by the weight of that feature • if the ith bit of the hash value is 0 => the ith component of V is decremented by the weight of that feature. • When all features have been processed, components of V are positive or negative. • The signs of components determine the corresponding bits of the final fingerprint.
  • 38. SIMHASH: FINGERPRINTS EXAMPLE • Consider a document after POS tagging: {(“ferret”, NOUN), (“go”, VERB)} • decide to have weights for features based on their POS tags: {“NOUN”: 1.0, “VERB”: 0.9} • h(x) - some hash function, and we will calculate 16-bit (ƒ=16) fingerprint F • Maintain V = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) for the document • h(“hello”) = 0101010101010111 • h(“hello”)[1] == 0 => V[1] = V[1] - 1.0 = -1.0 • h(“hello”)[2] == 1 => V[2] = V[2] + 1.0 = 1.0 • … • V = (-1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, 1.0, 1.0) • h(“go”) = 0111011100100111 • h(“go”)[1] == 0 => V[1] = V[1] - 0.9 = -1.9 • h(“go”)[2] == 1 => V[2] = V[2] + 0.9 = 1.9 • … • V = (-1.9, 1.9, -0.1, 1.9, -1.9, 1.9, -0.1, 1.9, -1.9, 0.1, -0.1, 0.1, -1.9, 1.9, 1.9, 1.9) • sign(V) = (-, +, -, +, -, +, -, +, -, +, -, +, -, +, +, +) • fingerprint F = 0101010101010111
  • 39. SIMHASH: FINGERPRINTS After converting document into simhash fingerprints, we face the following design problem: • Given a ƒ-bit fingerprint of a document, how quickly discover other fingerprints that differ in at most k bit-positions?
  • 40. SIMHASH: DATA STRUCTURE • SimHash data structure consists of t tables: T1, T2, … Tt.With each table Ti is associated 2 quantities: • an integer pi (number of top bit-positions used in comparisons) • a permutation πi over the ƒ bit-positions • Table Ti is constructed by applying permutation πi to each existing fingerprint F and the resulting set of permuted ƒ-bit fingerprints is sorted. • To optimize such structure to store it in the main-memory, it is possible to compress each such table (that can decrease it size by half approximately). • Example: DF = {0110, 0010, 0010, 1001, 1110}
 πi = {1→2, 2→4, 3→3, 4→1 } • πi (0110)= 0011, πi (0010)= 0010,
 πi (1001)= 1100, πi (1110)= 0111 Ti = 0 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1 1 1 0 0 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
  • 41. SIMHASH: ALGORITHM • Actually, the algorithm follows the general approach we saw for LSH • Given a fingerprint F and an integer k we probe tables from the data structure in parallel: • For each table Ti we identify all permuted fingerprints in Ti whose top pi bit- positions match the top pi bit-positions of πi(F). • After that, for each such permuted fingerprints, we check if it differs from the πi(F) in at most k bit-positions (in fact, comparing the Hamming distance between such fingerprints). • Identification of the first fingerprint in table Ti whose top pi bit-positions match the top pi bit-positions of πi(F) (Step 1) can be done in O(pi) steps by binary search. • If we assume that each fingerprint were truly a random bit sequence, it is possible to shrinks the run time to O(log pi) steps in expectation (interpolation search).
  • 42. SIMHASH: PARAMETERS • For a repository of 8B web-pages, ƒ=64 (64-bit simhash fingerprints) and k = 3 are reasonable. (Gurmeet Singh Manku et al., 2007) • For the given ƒ and k, to find a reasonable combination of t (number of tables) and pi (number of top bit-positions used in comparisons) we have to note that they are not independent from each other: • Increasing the number of tables t increases pi and hence reduces the query time. (large values for various pi let to avoid checking too many fingerprints in Step 2) • Decreasing the number of tables t reduces storage requirements, but reduces pi and hence increases the query time (small set of permutations let to avoid space blowup)
  • 43. SIMHASH: PARAMETERS • Consider ƒ = 64 (64-bit fingerprints) and k = 3, so documents whose fingerprints differ at most 3 bit-positions will be similar for us (duplicates). • Assume we have 8B = 234 existing fingerprints and decide to use t = 10 tables: • For instance, t = 10 = (5 2) , so we can use 5 blocks and choose 2 out of them in 10 ways. • Split 64 bits into 5 blocks having 13, 13, 13, 13 and 12 bits respectively. • For each such choice, permutation π corresponds to making the bits lying in the chosen blocks the leading bits. • pi is the total number of bits in the chosen blocks, thus pi = 25 or 26. • On average, a probe retrieves at most 234−25 = 512 (permuted) fingerprints. • if we seek all (permuted) fingerprints which match the top pi bits of a given (permuted) fingerprint, we expect 2d−pi fingerprints as matches • The total number of probes for 64-bit fingerprints and k = 3 is (64 3) = 41664 probes
  • 44. SIMHASH: PYTHON • https://github.com/leonsunliang/simhash
 A Python implementation of Simhash. • http://bibliographie-trac.ub.rub.de/browser/simhash.py
 Implementation of Charikar's simhashes in Python
  • 45. SIMHASH: READ MORE • http://www.cs.princeton.edu/courses/archive/spr04/ cos598B/bib/CharikarEstim.pdf • http://jmlr.org/proceedings/papers/v33/ shrivastava14.pdf • http://www.wwwconference.org/www2007/papers/ paper215.pdf • http://ferd.ca/simhashing-hopefully-made-simple.html
  • 46. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU