SlideShare a Scribd company logo
SEARCH4SIMILARS
at scale
What do you mean by similar?
■ Jaccard distance
■ Cosine distance
■ Lot’s of others
Deduplication / Plagiarism 
LSH
A B C D E F G
A
B
C
D
E
F
G
All you need is
to compare
each object
with all the
another.
O (n*n)
Your cap:
Compare only
similar items.
LSH Applications
■ Near-duplicate detection
■ Hierarchical clustering
■ Genome-wide association study
■ Image similarity identification
■ VisualRank
■ Gene expression similarity identification
■ Audio similarity identification
■ Nearest neighbor search
■ Audio fingerprint
■ Digital video fingerprinting
LSH is a dimensionality reduction
technique
■ Batch algorithm
■ Word “the” is not the same as word “bozo” when we compare two documents
– LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)
■ Hard to analyze
■ If you add new documents, you can’t find similar in real-time
– some online-related works for restricted cases
(http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf)
■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold
hall”
■ Fits for searching very similar objects. Not optimal to search for not too similar.
Search 4 sense
■ Bayes theorem
■ Bayesian statistics
■ Conjugate prior
■ Probabilistic graphical models
■ Topic modeling
■ pLSA / LDA
Bayes' theorem
where A and B are events.
■ P(A) and P(B) are the probabilities of A and B without regard to each other.
■ P(A | B), a conditional probability, is the probability of observing event A given that B is true.
■ P(B |A) is the probability of observing event B given that A is true.
Bayesian vs Frequentist statistics
■ Coin tossing
– coin fell 4 times of 5 on a head
𝑚+1
𝑛+2
■ Сonjugate prior
■ Exponential family
■ Sufficient statistic
Probabilistic Graphical Models
Topic modeling
Topic modeling assumptions
■ Document order does not matter (Bag of words)
■ Most common words do not characterize topic
■ Document collection could be represented as document-word pair (𝑑, 𝑤)
■ Each topic 𝑡 ∈ 𝑇 could be described via unknown distribution
𝑝 𝑊 𝑡 , 𝑤 ∈ 𝑊
■ Independency assumption 𝑝 𝑤 𝑡, 𝑑 = 𝑝 (𝑤|𝑡)
probabilistic Latent SemanticAnalysis
LDA
■ Almost the same as pLSA,
but with Dirichlet distribution as prior
Links
Mining Massive Datasets
■ http://infolab.stanford.edu/~ullman/mmds/book.pdf
■ https://ru.coursera.org/course/mmds
■ http://www.mmds.org/
K.Vorontsov. Machine Learning
■ https://www.youtube.com/watch?v=H7hlSz4WWhQ
■ https://www.youtube.com/watch?v=EOmv7fakk5E
■ http://www.machinelearning.ru/wiki/images/2/22/Voron-2013-
ptm.pdf
D.Vetrov. Bayes Statistics
■ https://compscicenter.ru/courses/bayes-course/2015-summer/
D.Koller. Probabilistic Graphical Models
■ https://ru.coursera.org/course/pgm
■ https://en.wikipedia.org/wiki/Jaccard_index
■ https://en.wikipedia.org/wiki/Cosine_similarity
■ https://en.wikipedia.org/wiki/MinHash
■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing
■ LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)
■ https://en.wikipedia.org/wiki/Bayesian_statistics
■ https://en.wikipedia.org/wiki/Conjugate_prior
■ https://en.wikipedia.org/wiki/Sufficient_statistic
■ https://en.wikipedia.org/wiki/Graphical_model
■ https://en.wikipedia.org/wiki/Topic_model
■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
■ https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_ana
lysis
Repository & Chats announcements
■ Github: https://github.com/scalalab3
– https://github.com/scalalab3/chatbot-engine
– https://github.com/scalalab3/logs-service
– https://github.com/scalalab3/lyrics-engine
■ Gitter: https://gitter.im/scalalab3/all
– https://gitter.im/scalalab3/lyrics-engine
– https://gitter.im/scalalab3/logs-service
– http://gitter.im/scalalab3/chatbot-engine

More Related Content

What's hot

Oke
OkeOke
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
Andrea Nuzzolese
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
Jane Frazier
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Liz Rodrigues
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary format
ErikWelch2
 
Doing data science with F# (BuildStuff)
Doing data science with F# (BuildStuff)Doing data science with F# (BuildStuff)
Doing data science with F# (BuildStuff)
Tomas Petricek
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
Eric Van Horenbeeck
 
Doing data science with F#
Doing data science with F#Doing data science with F#
Doing data science with F#
Tomas Petricek
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Jennifer D'Souza
 
Perspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from textPerspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from text
Jennifer D'Souza
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 Literature
Databricks
 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
Mariano Rodriguez-Muro
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystified
Jakob .
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
Artem Lutov
 

What's hot (14)

Oke
OkeOke
Oke
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary format
 
Doing data science with F# (BuildStuff)
Doing data science with F# (BuildStuff)Doing data science with F# (BuildStuff)
Doing data science with F# (BuildStuff)
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
 
Doing data science with F#
Doing data science with F#Doing data science with F#
Doing data science with F#
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 
Perspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from textPerspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from text
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 Literature
 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystified
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
 

Viewers also liked

Shift_Share_Industry_Target
Shift_Share_Industry_TargetShift_Share_Industry_Target
Shift_Share_Industry_Target
Jordan Exantus, LEED AP
 
Downloads
DownloadsDownloads
Spain and its monuments
Spain and its monumentsSpain and its monuments
Spain and its monuments
JuanmaProfe
 
Foamcub trainer
Foamcub trainerFoamcub trainer
Foamcub trainermesin oven
 
Christmas in spain
Christmas in spainChristmas in spain
Christmas in spain
JuanmaProfe
 
One day in our life
One day in our lifeOne day in our life
One day in our life
JuanmaProfe
 
Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Valentin Bazarevsky
 
1. Gud er troværdig
1. Gud er troværdig1. Gud er troværdig
1. Gud er troværdig
konfx
 
Science10 h permanentice
Science10 h permanenticeScience10 h permanentice
Science10 h permanentice
e_mcgaffney
 
Skolačka
SkolačkaSkolačka
Skolačka
evite
 
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
Chutiporn Ap
 
Probabilistic data structures in real life
Probabilistic data structures in real lifeProbabilistic data structures in real life
Probabilistic data structures in real life
Valentin Bazarevsky
 
Half life
Half lifeHalf life
Half life
Brittknee Basch
 
Day 7 powerpoint time on a clock
Day 7 powerpoint time on a clockDay 7 powerpoint time on a clock
Day 7 powerpoint time on a clock
Brittknee Basch
 
2013 module 2 the learning center
2013 module 2 the learning center2013 module 2 the learning center
2013 module 2 the learning center
Dan Pevear
 
Our traditions by the sea 2
Our traditions by the sea 2Our traditions by the sea 2
Our traditions by the sea 2
JuanmaProfe
 
Day 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewDay 7 powerpoint at the bell review
Day 7 powerpoint at the bell review
Brittknee Basch
 
Erinaceomorpha
ErinaceomorphaErinaceomorpha
Erinaceomorpha
e_mcgaffney
 
140218 seminar feb_revised-syk
140218 seminar feb_revised-syk140218 seminar feb_revised-syk
140218 seminar feb_revised-syk
Sangyoon Kang
 
Total ranks
Total ranksTotal ranks
Total ranks
Brittknee Basch
 

Viewers also liked (20)

Shift_Share_Industry_Target
Shift_Share_Industry_TargetShift_Share_Industry_Target
Shift_Share_Industry_Target
 
Downloads
DownloadsDownloads
Downloads
 
Spain and its monuments
Spain and its monumentsSpain and its monuments
Spain and its monuments
 
Foamcub trainer
Foamcub trainerFoamcub trainer
Foamcub trainer
 
Christmas in spain
Christmas in spainChristmas in spain
Christmas in spain
 
One day in our life
One day in our lifeOne day in our life
One day in our life
 
Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...
 
1. Gud er troværdig
1. Gud er troværdig1. Gud er troværdig
1. Gud er troværdig
 
Science10 h permanentice
Science10 h permanenticeScience10 h permanentice
Science10 h permanentice
 
Skolačka
SkolačkaSkolačka
Skolačka
 
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
 
Probabilistic data structures in real life
Probabilistic data structures in real lifeProbabilistic data structures in real life
Probabilistic data structures in real life
 
Half life
Half lifeHalf life
Half life
 
Day 7 powerpoint time on a clock
Day 7 powerpoint time on a clockDay 7 powerpoint time on a clock
Day 7 powerpoint time on a clock
 
2013 module 2 the learning center
2013 module 2 the learning center2013 module 2 the learning center
2013 module 2 the learning center
 
Our traditions by the sea 2
Our traditions by the sea 2Our traditions by the sea 2
Our traditions by the sea 2
 
Day 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewDay 7 powerpoint at the bell review
Day 7 powerpoint at the bell review
 
Erinaceomorpha
ErinaceomorphaErinaceomorpha
Erinaceomorpha
 
140218 seminar feb_revised-syk
140218 seminar feb_revised-syk140218 seminar feb_revised-syk
140218 seminar feb_revised-syk
 
Total ranks
Total ranksTotal ranks
Total ranks
 

Similar to Search4similars

Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
Bikash Chandra Karmokar
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
PATTY: A Taxonomy of Relational Patterns with Semantic Types
PATTY: A Taxonomy of Relational Patterns with Semantic TypesPATTY: A Taxonomy of Relational Patterns with Semantic Types
PATTY: A Taxonomy of Relational Patterns with Semantic Types
Akihiro Kameda
 
Translating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question AnsweringTranslating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question Answering
Tommaso Soru
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
François Scharffe
 

Similar to Search4similars (9)

Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
PATTY: A Taxonomy of Relational Patterns with Semantic Types
PATTY: A Taxonomy of Relational Patterns with Semantic TypesPATTY: A Taxonomy of Relational Patterns with Semantic Types
PATTY: A Taxonomy of Relational Patterns with Semantic Types
 
Translating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question AnsweringTranslating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question Answering
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 

Recently uploaded

openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
VanTuDuong1
 
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEERDELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
EMERSON EDUARDO RODRIGUES
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.
supriyaDicholkar1
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
vmspraneeth
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
foxlyon
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdfFUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
EMERSON EDUARDO RODRIGUES
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
ShurooqTaib
 

Recently uploaded (20)

openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
 
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEERDELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdfFUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
 

Search4similars

  • 2. What do you mean by similar? ■ Jaccard distance ■ Cosine distance ■ Lot’s of others
  • 3. Deduplication / Plagiarism  LSH A B C D E F G A B C D E F G All you need is to compare each object with all the another. O (n*n) Your cap: Compare only similar items.
  • 4. LSH Applications ■ Near-duplicate detection ■ Hierarchical clustering ■ Genome-wide association study ■ Image similarity identification ■ VisualRank ■ Gene expression similarity identification ■ Audio similarity identification ■ Nearest neighbor search ■ Audio fingerprint ■ Digital video fingerprinting
  • 5. LSH is a dimensionality reduction technique ■ Batch algorithm ■ Word “the” is not the same as word “bozo” when we compare two documents – LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf) ■ Hard to analyze ■ If you add new documents, you can’t find similar in real-time – some online-related works for restricted cases (http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf) ■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold hall” ■ Fits for searching very similar objects. Not optimal to search for not too similar.
  • 6. Search 4 sense ■ Bayes theorem ■ Bayesian statistics ■ Conjugate prior ■ Probabilistic graphical models ■ Topic modeling ■ pLSA / LDA
  • 7. Bayes' theorem where A and B are events. ■ P(A) and P(B) are the probabilities of A and B without regard to each other. ■ P(A | B), a conditional probability, is the probability of observing event A given that B is true. ■ P(B |A) is the probability of observing event B given that A is true.
  • 8. Bayesian vs Frequentist statistics ■ Coin tossing – coin fell 4 times of 5 on a head 𝑚+1 𝑛+2 ■ Сonjugate prior ■ Exponential family ■ Sufficient statistic
  • 11. Topic modeling assumptions ■ Document order does not matter (Bag of words) ■ Most common words do not characterize topic ■ Document collection could be represented as document-word pair (𝑑, 𝑤) ■ Each topic 𝑡 ∈ 𝑇 could be described via unknown distribution 𝑝 𝑊 𝑡 , 𝑤 ∈ 𝑊 ■ Independency assumption 𝑝 𝑤 𝑡, 𝑑 = 𝑝 (𝑤|𝑡)
  • 13. LDA ■ Almost the same as pLSA, but with Dirichlet distribution as prior
  • 14. Links Mining Massive Datasets ■ http://infolab.stanford.edu/~ullman/mmds/book.pdf ■ https://ru.coursera.org/course/mmds ■ http://www.mmds.org/ K.Vorontsov. Machine Learning ■ https://www.youtube.com/watch?v=H7hlSz4WWhQ ■ https://www.youtube.com/watch?v=EOmv7fakk5E ■ http://www.machinelearning.ru/wiki/images/2/22/Voron-2013- ptm.pdf D.Vetrov. Bayes Statistics ■ https://compscicenter.ru/courses/bayes-course/2015-summer/ D.Koller. Probabilistic Graphical Models ■ https://ru.coursera.org/course/pgm ■ https://en.wikipedia.org/wiki/Jaccard_index ■ https://en.wikipedia.org/wiki/Cosine_similarity ■ https://en.wikipedia.org/wiki/MinHash ■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing ■ LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf) ■ https://en.wikipedia.org/wiki/Bayesian_statistics ■ https://en.wikipedia.org/wiki/Conjugate_prior ■ https://en.wikipedia.org/wiki/Sufficient_statistic ■ https://en.wikipedia.org/wiki/Graphical_model ■ https://en.wikipedia.org/wiki/Topic_model ■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation ■ https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_ana lysis
  • 15. Repository & Chats announcements ■ Github: https://github.com/scalalab3 – https://github.com/scalalab3/chatbot-engine – https://github.com/scalalab3/logs-service – https://github.com/scalalab3/lyrics-engine ■ Gitter: https://gitter.im/scalalab3/all – https://gitter.im/scalalab3/lyrics-engine – https://gitter.im/scalalab3/logs-service – http://gitter.im/scalalab3/chatbot-engine