SlideShare a Scribd company logo
STSqgrams STSnpmi
STScsds
SOFTCARDINALITY-CORE: Improving Text Overlap with
Distributional Measures for Semantic Textual Similarity
Sergio Jimenez Claudia Becerra Alexander Gelbukh
Soft cardinality has been shown to be a very strong text-
overlapping baseline for the task of measuring semantic textual
similarity (STS), obtaining 3rd place in SemEval-2012. At *SEM-
2013 shared task, beside the plain text-overlapping approach, we
tested within soft cardinality two distributional word-similarity
functions derived from the ukWack corpus. Unfortunately, we
combined these measures with other features using
regression, obtaining positions 18th, 22nd and 23rd among the 90
participants systems in the official ranking. Already after the
release of the gold standard annotations of the test data, we
observed that using only the similarity measures without
combining them with other features would have obtained
positions 6th, 7th and 8th; moreover, an arithmetic average of
these similarity measures would have been 4th (mean=0.5747).
This paper describes both the 3 systems as they were submitted
and the similarity measures that would obtained those better
results.
Abstract
A=
, ,
B=
, ,
|A|=3
|B|=3
Classical
(integer)
Soft
(real)
Soft Cardinality
Cardinality: number of different elements in a
collection, i.e. set definition.
C=
,
= |C|=1 |C|’=1.0
inter-elements
similarity
elements
weights
“softness”
control
word-to-word
similarity
idf term
weighting
Symmetrical Tversky’s
Ratio Model
Original Tversky’s ratio model
A
B
ι β
Character
q-grams
(strm reused with
a range of q-grams)
PMI
(normalized
in [0,1])
Context-set
distributional sim.
(strm re-used with
sentence frequencies)
Un-official Results
• Tokenization
• Lowecasing
• Stop-words
removal
Pre-processing
Distributional measures trained
in the ukWack corpus
You can build a simple semantic textual
similarity (STS) function by …
1. Representing sentences as un-ordered
collections of words (bags of words)
2. Counting different elements in those
collections and their unions using classic
set cardinality (but you better use our
soft cardinality.)
3. Comparing collection pairs using any
cardinality-based resemblance coefficient
such as Jaccard or Dice coefficient (but
you better use our Symmetrical Tversky’s
ratio model)
• Punctuation removal
• Stemming (Porter)
• IDF term weighting Word and character
q-grams overlap

More Related Content

Similar to SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity

Assessing the Sufficiency of Arguments through Conclusion Generation.pdf
Assessing the Sufficiency of Arguments through Conclusion Generation.pdfAssessing the Sufficiency of Arguments through Conclusion Generation.pdf
Assessing the Sufficiency of Arguments through Conclusion Generation.pdf
Asia Smith
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
Editor IJCATR
 

Similar to SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity (20)

semeval2016
semeval2016semeval2016
semeval2016
 
Automatic Essay Grading With Probabilistic Latent Semantic Analysis
Automatic Essay Grading With Probabilistic Latent Semantic AnalysisAutomatic Essay Grading With Probabilistic Latent Semantic Analysis
Automatic Essay Grading With Probabilistic Latent Semantic Analysis
 
P13 corley
P13 corleyP13 corley
P13 corley
 
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
 
One Graduate Paper
One Graduate PaperOne Graduate Paper
One Graduate Paper
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Automatic Essay Scoring A Review On The Feature Analysis Techniques
Automatic Essay Scoring  A Review On The Feature Analysis TechniquesAutomatic Essay Scoring  A Review On The Feature Analysis Techniques
Automatic Essay Scoring A Review On The Feature Analysis Techniques
 
Automated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfAutomated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdf
 
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic AnalysisAutomated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic Analysis
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matching
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
 
Assessing the Sufficiency of Arguments through Conclusion Generation.pdf
Assessing the Sufficiency of Arguments through Conclusion Generation.pdfAssessing the Sufficiency of Arguments through Conclusion Generation.pdf
Assessing the Sufficiency of Arguments through Conclusion Generation.pdf
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Robustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfRobustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdf
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
 
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
 

More from Sergio Jimenez

More from Sergio Jimenez (6)

Text Comparison Using Soft Cardinality
Text Comparison Using Soft CardinalityText Comparison Using Soft Cardinality
Text Comparison Using Soft Cardinality
 
SC spectra: A new soft cardinality approximation for text comparison
SC spectra: A new soft cardinality approximation for text comparisonSC spectra: A new soft cardinality approximation for text comparison
SC spectra: A new soft cardinality approximation for text comparison
 
Soft Cardinality: A Parameterized Similarity Function for Text Comparison
Soft Cardinality: A Parameterized Similarity Function for Text ComparisonSoft Cardinality: A Parameterized Similarity Function for Text Comparison
Soft Cardinality: A Parameterized Similarity Function for Text Comparison
 
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...
 
UNAL: Discriminating between Literal and Figurative Phrasal Usage Using Distr...
UNAL: Discriminating between Literal and Figurative Phrasal Usage Using Distr...UNAL: Discriminating between Literal and Figurative Phrasal Usage Using Distr...
UNAL: Discriminating between Literal and Figurative Phrasal Usage Using Distr...
 
SOFTCARDINALITY: Hierarchical Text Overlap for Student Response Analysis
SOFTCARDINALITY: Hierarchical Text Overlap for Student Response AnalysisSOFTCARDINALITY: Hierarchical Text Overlap for Student Response Analysis
SOFTCARDINALITY: Hierarchical Text Overlap for Student Response Analysis
 

Recently uploaded

Recently uploaded (20)

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity

  • 1. STSqgrams STSnpmi STScsds SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity Sergio Jimenez Claudia Becerra Alexander Gelbukh Soft cardinality has been shown to be a very strong text- overlapping baseline for the task of measuring semantic textual similarity (STS), obtaining 3rd place in SemEval-2012. At *SEM- 2013 shared task, beside the plain text-overlapping approach, we tested within soft cardinality two distributional word-similarity functions derived from the ukWack corpus. Unfortunately, we combined these measures with other features using regression, obtaining positions 18th, 22nd and 23rd among the 90 participants systems in the official ranking. Already after the release of the gold standard annotations of the test data, we observed that using only the similarity measures without combining them with other features would have obtained positions 6th, 7th and 8th; moreover, an arithmetic average of these similarity measures would have been 4th (mean=0.5747). This paper describes both the 3 systems as they were submitted and the similarity measures that would obtained those better results. Abstract A= , , B= , , |A|=3 |B|=3 Classical (integer) Soft (real) Soft Cardinality Cardinality: number of different elements in a collection, i.e. set definition. C= , = |C|=1 |C|’=1.0 inter-elements similarity elements weights “softness” control word-to-word similarity idf term weighting Symmetrical Tversky’s Ratio Model Original Tversky’s ratio model A B Îą β Character q-grams (strm reused with a range of q-grams) PMI (normalized in [0,1]) Context-set distributional sim. (strm re-used with sentence frequencies) Un-official Results • Tokenization • Lowecasing • Stop-words removal Pre-processing Distributional measures trained in the ukWack corpus You can build a simple semantic textual similarity (STS) function by … 1. Representing sentences as un-ordered collections of words (bags of words) 2. Counting different elements in those collections and their unions using classic set cardinality (but you better use our soft cardinality.) 3. Comparing collection pairs using any cardinality-based resemblance coefficient such as Jaccard or Dice coefficient (but you better use our Symmetrical Tversky’s ratio model) • Punctuation removal • Stemming (Porter) • IDF term weighting Word and character q-grams overlap