Search4similars

What do you mean by similar?
■ Jaccard distance
■ Cosine distance
■ Lot’s of others

Deduplication / Plagiarism 
LSH
A B C D E F G
A
B
C
D
E
F
G
All you need is
to compare
each object
with all the
another.
O (n*n)
Your cap:
Compare only
similar items.

LSH Applications
■ Near-duplicate detection
■ Hierarchical clustering
■ Genome-wide association study
■ Image similarity identification
■ VisualRank
■ Gene expression similarity identification
■ Audio similarity identification
■ Nearest neighbor search
■ Audio fingerprint
■ Digital video fingerprinting

LSH is a dimensionality reduction
technique
■ Batch algorithm
■ Word “the” is not the same as word “bozo” when we compare two documents
– LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)
■ Hard to analyze
■ If you add new documents, you can’t find similar in real-time
– some online-related works for restricted cases
(http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf)
■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold
hall”
■ Fits for searching very similar objects. Not optimal to search for not too similar.

Search 4 sense
■ Bayes theorem
■ Bayesian statistics
■ Conjugate prior
■ Probabilistic graphical models
■ Topic modeling
■ pLSA / LDA

Bayes' theorem
where A and B are events.
■ P(A) and P(B) are the probabilities of A and B without regard to each other.
■ P(A | B), a conditional probability, is the probability of observing event A given that B is true.
■ P(B |A) is the probability of observing event B given that A is true.

Bayesian vs Frequentist statistics
■ Coin tossing
– coin fell 4 times of 5 on a head
𝑚+1
𝑛+2
■ Сonjugate prior
■ Exponential family
■ Sufficient statistic

Probabilistic Graphical Models

Topic modeling assumptions
■ Document order does not matter (Bag of words)
■ Most common words do not characterize topic
■ Document collection could be represented as document-word pair (𝑑, 𝑤)
■ Each topic 𝑡 ∈ 𝑇 could be described via unknown distribution
𝑝 𝑊 𝑡 , 𝑤 ∈ 𝑊
■ Independency assumption 𝑝 𝑤 𝑡, 𝑑 = 𝑝 (𝑤|𝑡)

probabilistic Latent SemanticAnalysis

LDA
■ Almost the same as pLSA,
but with Dirichlet distribution as prior

Links
Mining Massive Datasets
■ http://infolab.stanford.edu/~ullman/mmds/book.pdf
■ https://ru.coursera.org/course/mmds
■ http://www.mmds.org/
K.Vorontsov. Machine Learning
■ https://www.youtube.com/watch?v=H7hlSz4WWhQ
■ https://www.youtube.com/watch?v=EOmv7fakk5E
■ http://www.machinelearning.ru/wiki/images/2/22/Voron-2013-
ptm.pdf
D.Vetrov. Bayes Statistics
■ https://compscicenter.ru/courses/bayes-course/2015-summer/
D.Koller. Probabilistic Graphical Models
■ https://ru.coursera.org/course/pgm
■ https://en.wikipedia.org/wiki/Jaccard_index
■ https://en.wikipedia.org/wiki/Cosine_similarity
■ https://en.wikipedia.org/wiki/MinHash
■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing
■ LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)
■ https://en.wikipedia.org/wiki/Bayesian_statistics
■ https://en.wikipedia.org/wiki/Conjugate_prior
■ https://en.wikipedia.org/wiki/Sufficient_statistic
■ https://en.wikipedia.org/wiki/Graphical_model
■ https://en.wikipedia.org/wiki/Topic_model
■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
■ https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_ana
lysis

Repository & Chats announcements
■ Github: https://github.com/scalalab3
– https://github.com/scalalab3/chatbot-engine
– https://github.com/scalalab3/logs-service
– https://github.com/scalalab3/lyrics-engine
■ Gitter: https://gitter.im/scalalab3/all
– https://gitter.im/scalalab3/lyrics-engine
– https://gitter.im/scalalab3/logs-service
– http://gitter.im/scalalab3/chatbot-engine

Search4similars

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Search4similars

Similar to Search4similars (9)

Recently uploaded

Recently uploaded (20)

Search4similars