SlideShare a Scribd company logo
1 of 33
Download to read offline
Paraphrase Detection in NLP
Yuriy Guts
ML Engineer
=
?
Paraphrasing
Any trip to Italy should include a visit to Tuscany to sample their exquisite wines.
Be sure to include a Tuscan wine-tasting experience when visiting Italy.
Paraphrase Identification
Where can I get very professional and reliable
envelope printing service in Sydney?
Where can I get very affordable branded
envelope printing service in Sydney?
Why are doctors always late?
Why doctors always make you wait for 15-20 minutes
before they see you?
404,290 pairs
Training set
2,345,796 pairs
Test set
Data
Q1 (ID, Title) Q2 (ID, Title)
Target
Binary (metric: log-loss)
Challenges
What are the best books on IT leadership?
What are the best books about leadership?
Sometimes a single word matters
Will the Miami Heat win the NBA championship in 2011?
Will the Miami Heat win the NBA championship in 2012?
How do I lose 20 pounds?
How do I lose 15 pounds?
Sometimes there are almost no shared words
I am unable to talk to girls, leave being friendly with them. Why?
I am shy to talk to any woman because i get nervous and
freaked out around them. What is the solution?
Is there a Quora user who have seen a UFO?
Have you seen an alien?
Sometimes ALL the words are shared
Is the Government of Pakistan encouraging India by not
taking any real action against ceasefire violations?
Is the Government of India encouraging Pakistan by not
taking any real action against ceasefire violations?
What is the most interesting thing we learned about
Portugal's World Cup team in their match against Germany?
What is the most interesting thing we learned about
Germany's World Cup team in their match against Portugal?
Approaches
1. Counting Stuff.
2. Information Theory.
3. Linguistics.
4. Deep Learning.
BoW Representations
Image copyright © S. Mukherjee
Cosine Similarity
Image copyright © C. Perone
Jaccard Similarity
Image copyright © C. Wagner
Edit Distances
1. Levenshtein
2. Damerau-Levenshtein
3. Jaro
4. Jaro-Winkler
… … ...
Lexical Databases and Ontologies
house, home
dwelling
hermitage cottage
backyard
veranda
study
“A place that serves as the living
quarters of one or more families”
. . .
penthouse
Meronym
Hyponym
HyponymHypernym
Hyponym
Def
WordNet
“The complete meaning of a word is always contextual, and no study of
meaning apart from context can be taken seriously”
Distributed Hypothesis of Language
John R. Firth. The technique of semantics.
Transactions of the Philological Society, 1935.
The minimum amount of “work” needed to transform document 1 to document 2.
Inspired by Earth Mover’s Distance, a well-studied transportation problem.
Word Mover’s Distance (WMD)
M. Kusner et al. “From Word Embeddings to Document Distances”, 2015.
http://proceedings.mlr.press/v37/kusnerb15.pdf
WMD: Linear Optimization Problem
M. Kusner et al. “From Word Embeddings to Document Distances”, 2015.
http://proceedings.mlr.press/v37/kusnerb15.pdf
nBOW frequency of the
i-th word in the document
“How much” of word i
travels to word j
Morphology & Syntax Features
https://demos.explosion.ai/displacy/
Architectural Principles
for State-of-the-Art Neural NLP
Embed
Encode
Attend
Predict
State-of-the-Art NLP Pipeline
Image copyright © Explosion AI
Recurrent Highway Networks
Highway layer:
Srivastava et al. “Recurrent Highway Networks”,
2015
https://arxiv.org/pdf/1607.03474.pdf
Two Input Documents? Siamese Network
Run the same encoding step for every input.
Share encoder weights, don’t learn WE1
and WE2
separately.
Distance Learning, Contrastive Loss
Penalize similar pairs by a monotonically increasing function of their learned distance.
Penalize different pairs by a monotonically decreasing function of their learned distance.
Hadsell, Chopra, LeCun “Dimensionality Reduction by Learning an Invariant Mapping”, 2006.
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
404,290 pairs
Training set
2,345,796 pairs
Test set
Data
Q1 (ID, Title) Q2 (ID, Title)
Target
Binary (metric: log-loss)
How can I prevent pneumonia?
How do you prevent pneumonia?
Sometimes the labeling is just plain wrong
What is the car in this picture?
What car is in this picture?
How do I protect single phasing of a 3 phase motor?
What is single phase and 3 phase?
People misspell. A lot.
demonitization demonitozation masturbution masturbtion
demonitizing demonetiztion masturation mastburation
demonetising demoneitisation mastrubating masturabate
demonitisation demoneticzation masturbuation mastrubration
demonitize demonitizition masterbuting mastubrate
demonatisation demonetisations mastubate masturbute
demonitazation demonitasation mastribution mastribute
demonitesation demonestisation masturburation masterbuate
demoneytisation demotenization mastrabutation mastubating
demonestation demonitised mastubation
demonitising demonsation
100,000+
spelling corrections
Different target balance for Train and Test
Final Solution (Simplified)
Tokenized Text
Sequences of
Pre-trained
Embeddings
Statistical Features
Fuzzy Distances
Linguistic Features
Topic Features
Leaks ¯_(ツ)_/¯
Deep NN Predictions
Embedding Features
Final Model
(GBM)
Prediction
Calibration
~200 Features
for the final model
facebook.com/yuriy.guts
github.com/YuriyGuts/kaggle-quora-question-pairs

More Related Content

What's hot

Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGISynaptonIncorporated
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General AudiencesSangwoo Mo
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 

What's hot (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Text Classification
Text ClassificationText Classification
Text Classification
 
BERT
BERTBERT
BERT
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
OpenAI Chatgpt.pptx
OpenAI Chatgpt.pptxOpenAI Chatgpt.pptx
OpenAI Chatgpt.pptx
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 

Similar to Paraphrase Detection in NLP

Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonAlyona Medelyan
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part IDelip Rao
 
Why use video by david mann
Why use video by david mannWhy use video by david mann
Why use video by david mannDavid Mann
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games ResearchJose Zagal
 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptxHaibinSu2
 
Deep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDebdoot Sheet
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowValeria de Paiva
 
The expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigmsThe expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigmsLawrie Hunter
 
From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...Peter Doolittle
 
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...Amit Sheth
 

Similar to Paraphrase Detection in NLP (20)

Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
 
Nltk
NltkNltk
Nltk
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
download
downloaddownload
download
 
download
downloaddownload
download
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part I
 
Why use video by david mann
Why use video by david mannWhy use video by david mann
Why use video by david mann
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptx
 
Deep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all about
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
The expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigmsThe expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigms
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
 
From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...
 
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
 

More from Yuriy Guts

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivYuriy Guts
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET DevelopersYuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETYuriy Guts
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseYuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesYuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingYuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryYuriy Guts
 

More from Yuriy Guts (19)

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Paraphrase Detection in NLP

  • 1. Paraphrase Detection in NLP Yuriy Guts ML Engineer = ?
  • 2. Paraphrasing Any trip to Italy should include a visit to Tuscany to sample their exquisite wines. Be sure to include a Tuscan wine-tasting experience when visiting Italy.
  • 3. Paraphrase Identification Where can I get very professional and reliable envelope printing service in Sydney? Where can I get very affordable branded envelope printing service in Sydney? Why are doctors always late? Why doctors always make you wait for 15-20 minutes before they see you?
  • 4. 404,290 pairs Training set 2,345,796 pairs Test set Data Q1 (ID, Title) Q2 (ID, Title) Target Binary (metric: log-loss)
  • 6. What are the best books on IT leadership? What are the best books about leadership? Sometimes a single word matters Will the Miami Heat win the NBA championship in 2011? Will the Miami Heat win the NBA championship in 2012? How do I lose 20 pounds? How do I lose 15 pounds?
  • 7. Sometimes there are almost no shared words I am unable to talk to girls, leave being friendly with them. Why? I am shy to talk to any woman because i get nervous and freaked out around them. What is the solution? Is there a Quora user who have seen a UFO? Have you seen an alien?
  • 8. Sometimes ALL the words are shared Is the Government of Pakistan encouraging India by not taking any real action against ceasefire violations? Is the Government of India encouraging Pakistan by not taking any real action against ceasefire violations? What is the most interesting thing we learned about Portugal's World Cup team in their match against Germany? What is the most interesting thing we learned about Germany's World Cup team in their match against Portugal?
  • 9. Approaches 1. Counting Stuff. 2. Information Theory. 3. Linguistics. 4. Deep Learning.
  • 13. Edit Distances 1. Levenshtein 2. Damerau-Levenshtein 3. Jaro 4. Jaro-Winkler … … ...
  • 14. Lexical Databases and Ontologies house, home dwelling hermitage cottage backyard veranda study “A place that serves as the living quarters of one or more families” . . . penthouse Meronym Hyponym HyponymHypernym Hyponym Def
  • 16. “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously” Distributed Hypothesis of Language John R. Firth. The technique of semantics. Transactions of the Philological Society, 1935.
  • 17. The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied transportation problem. Word Mover’s Distance (WMD) M. Kusner et al. “From Word Embeddings to Document Distances”, 2015. http://proceedings.mlr.press/v37/kusnerb15.pdf
  • 18. WMD: Linear Optimization Problem M. Kusner et al. “From Word Embeddings to Document Distances”, 2015. http://proceedings.mlr.press/v37/kusnerb15.pdf nBOW frequency of the i-th word in the document “How much” of word i travels to word j
  • 19. Morphology & Syntax Features https://demos.explosion.ai/displacy/
  • 22. State-of-the-Art NLP Pipeline Image copyright © Explosion AI
  • 23. Recurrent Highway Networks Highway layer: Srivastava et al. “Recurrent Highway Networks”, 2015 https://arxiv.org/pdf/1607.03474.pdf
  • 24. Two Input Documents? Siamese Network Run the same encoding step for every input. Share encoder weights, don’t learn WE1 and WE2 separately.
  • 25. Distance Learning, Contrastive Loss Penalize similar pairs by a monotonically increasing function of their learned distance. Penalize different pairs by a monotonically decreasing function of their learned distance. Hadsell, Chopra, LeCun “Dimensionality Reduction by Learning an Invariant Mapping”, 2006. http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
  • 26. 404,290 pairs Training set 2,345,796 pairs Test set Data Q1 (ID, Title) Q2 (ID, Title) Target Binary (metric: log-loss)
  • 27. How can I prevent pneumonia? How do you prevent pneumonia? Sometimes the labeling is just plain wrong What is the car in this picture? What car is in this picture? How do I protect single phasing of a 3 phase motor? What is single phase and 3 phase?
  • 28. People misspell. A lot. demonitization demonitozation masturbution masturbtion demonitizing demonetiztion masturation mastburation demonetising demoneitisation mastrubating masturabate demonitisation demoneticzation masturbuation mastrubration demonitize demonitizition masterbuting mastubrate demonatisation demonetisations mastubate masturbute demonitazation demonitasation mastribution mastribute demonitesation demonestisation masturburation masterbuate demoneytisation demotenization mastrabutation mastubating demonestation demonitised mastubation demonitising demonsation 100,000+ spelling corrections
  • 29. Different target balance for Train and Test
  • 30. Final Solution (Simplified) Tokenized Text Sequences of Pre-trained Embeddings Statistical Features Fuzzy Distances Linguistic Features Topic Features Leaks ¯_(ツ)_/¯ Deep NN Predictions Embedding Features Final Model (GBM) Prediction Calibration ~200 Features for the final model
  • 31.
  • 32.