An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
ocorcho@fi.upm.es
@ocorcho ISWC’17
oeg-upm.net
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivation
2
How representative is an abstract?
Scientific Research
Practitioners
Reviewers
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivation
3
How representative are summaries based
on scientific discourse categories?
Scientific Research
Practitioners
Reviewers
approach
challenge
background
outcome
future work
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Representativeness
4
Full-Paper
Summary
Internal
External
finding related items
describing main ideas
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Probabilistic Topic Models
5
• Each document is a mixture of corpus-wide topics
• Each topic is a distribution over words
• Each word is drawn from one of those topics
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research,
Latent Dirichlet Allocation (LDA)
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Representativeness Measure
6
Internal
External
precision / recall / f-measure
JSD-based similarity
[d1,d2,d3,..dn] [s1,s2,s3,..sn]
[h1,h2,..hn] [j1,j2,..jn] [j1,j2,..jn] [k1,k2,..kn] [m1,m2,..mn]
Full-Paper Summary
JSD-based
similarity
JSD-based
similarity
• Feature vectors in Topic Models are topic distributions expressed as vectors of probabi
• The similarity measure used in our analysis is based on the Jensen Shannon-Divergen
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
7
Advances in
Space Research
Procedia
Chemistry
Journal of
Pharmaceutical Analysis
Journal of
Web Semantics
Elsevier API
1000 papers
( + abstracts)
Topic
Model
discover
rhetorical
parts
training (only full-papers)
inference
1000 papers
( + abstracts,
+ discourse parts)
network of related papers
( + abstracts + discourse parts)
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
8
Advances in Space Research
Corpus
Procedia Chemistry
Corpus
Journal of Pharmaceutical
Analysis Corpus
Journal of Web Semantics
Corpus
• http://librairy.linkeddata.es/resources/domains/aisr
Test
Corpus
• http://librairy.linkeddata.es/resources/domains/pc
• http://librairy.linkeddata.es/resources/domains/jopa
• http://librairy.linkeddata.es/resources/domains/jows
• http://librairy.linkeddata.es/resources/domains/group1
• Topics in a Corpus:
http://librairy.linkeddata.es/resources/domains/group1/topics?words=10
• Papers in a Corpus:
http://librairy.linkeddata.es/resources/domains/group1/items?size=10
Explore a Corpus
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
9
Full-Paper
• Info:
http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106?content=true
• Parts:
http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106/parts
• abstract:
http://librairy.linkeddata.es/resources/parts/adfe85d9634654e4cfd7148be7cd2b29?content=true
• approach:
http://librairy.linkeddata.es/resources/parts/83f2b9722953034d7b6b50cbead4ec6b?content=true
• outcome:
http://librairy.linkeddata.es/resources/parts/61452a5ec420c8926160ae748c12a826?content=true
• challenge:
http://librairy.linkeddata.es/resources/parts/8858ef323fc09efbdcd46b9de45f146c?content=true
• background:
http://librairy.linkeddata.es/resources/parts/d118ef60d5e874d69d92c6b07be68b61?content=true
• future-work:
http://librairy.linkeddata.es/resources/parts/92be5400df5bb331e5f7f692e6b05bca?content=true
• Topic Distribution of Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-84924147106/topics?words=15
• Topic Distribution of Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/topics?words=15
• Similarity between Full-Paper and Abstract:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&relatedId=adfe85d9634654e4cfd7148be7cd2b29
• Similarity between Full-Paper and Approach content:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&relatedId=83f2b9722953034d7b6b50cbead4ec6b
Internal Representativeness
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluation
10
• Similar papers to Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&resourceType=item&size=5
• Similar papers to Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila
rity&resourceType=item&size=5
• Similar papers to Approach content:
http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil
arity&resourceType=item&size=5
• Similar summaries to a Full-Paper:
http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-
84924147106/relations?type=similarity&resourceType=part&size=5
• Similar summaries to an Abstract:
http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila
rity&resourceType=part&size=5
• Similar summaries to Approach:
http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil
arity&resourceType=part&size=5
External Representativeness
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: Size of Summaries
11
The approach, the background
and the outcome content of a
paper generate more accurate
topic distributions than those
created from other approaches
as the abstract.
Since LDA considers documents
as bag-of-words, the text length
affects the accuracy of the topic
distributions inferred by the
model
Relative size of summaries respect to full-paper
Absolute size of summaries (in number of characters)
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: Internal Representativeness
12
• The Internal Representativeness of a summary measures the similarity of
this summary against the original full-text research paper
• This similarity is based on the JSD between the topic distribution of each
of them
• Results suggest than the distribution of topics describing the text created
from the approach content is the most similar to the one corresponding to
the full-content of the paper
internal-representativeness
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: External Representativeness
13
• The External Representativeness of a summary measures how different
is the set of related documents obtained with respect to those derived
from the original text
• Similarity thresholds from 0.5 to 0.99 were considered in experiments
precision recall
• In terms of recall, the upward trend followed by the approach, the
outcome and the background content remarks the assumption of
summaries containing key words allow to discover more similar papers
than others
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results: External Representativeness
14
f-measure
• For higher similarity thresholds, i.e. for strongly related papers, the
recommendations discovered by using the approach are more precise
than those discovered by using the abstract.
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Conclusions
15
• We have studied the Topic-based similarities among scientific documents
based on their abstract sections with respect to summaries
corresponding to their scientific discourse categories.
• Two novel measures have been proposed: (1) internal-
representativeness and (2) external-representativeness.
• Results show that summaries created from the approach, outcome or
background content of a paper describe more accurately its full-content in
terms of overall ideas and related documents than abstracts.
• In order to avoid an influence of the size of the summaries on the
accuracy of the results, in future work we plan to use probabilistic topic
model algorithms oriented to handle short-texts such as BTM to describe
texts .
An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
ocorcho@fi.upm.es
@ocorcho ISWC’17
oeg-upm.net

An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts

  • 1.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts ocorcho@fi.upm.es @ocorcho ISWC’17 oeg-upm.net Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain
  • 2.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Motivation 2 How representative is an abstract? Scientific Research Practitioners Reviewers
  • 3.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Motivation 3 How representative are summaries based on scientific discourse categories? Scientific Research Practitioners Reviewers approach challenge background outcome future work
  • 4.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Representativeness 4 Full-Paper Summary Internal External finding related items describing main ideas
  • 5.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Probabilistic Topic Models 5 • Each document is a mixture of corpus-wide topics • Each topic is a distribution over words • Each word is drawn from one of those topics Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, Latent Dirichlet Allocation (LDA)
  • 6.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Representativeness Measure 6 Internal External precision / recall / f-measure JSD-based similarity [d1,d2,d3,..dn] [s1,s2,s3,..sn] [h1,h2,..hn] [j1,j2,..jn] [j1,j2,..jn] [k1,k2,..kn] [m1,m2,..mn] Full-Paper Summary JSD-based similarity JSD-based similarity • Feature vectors in Topic Models are topic distributions expressed as vectors of probabi • The similarity measure used in our analysis is based on the Jensen Shannon-Divergen
  • 7.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 7 Advances in Space Research Procedia Chemistry Journal of Pharmaceutical Analysis Journal of Web Semantics Elsevier API 1000 papers ( + abstracts) Topic Model discover rhetorical parts training (only full-papers) inference 1000 papers ( + abstracts, + discourse parts) network of related papers ( + abstracts + discourse parts)
  • 8.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 8 Advances in Space Research Corpus Procedia Chemistry Corpus Journal of Pharmaceutical Analysis Corpus Journal of Web Semantics Corpus • http://librairy.linkeddata.es/resources/domains/aisr Test Corpus • http://librairy.linkeddata.es/resources/domains/pc • http://librairy.linkeddata.es/resources/domains/jopa • http://librairy.linkeddata.es/resources/domains/jows • http://librairy.linkeddata.es/resources/domains/group1 • Topics in a Corpus: http://librairy.linkeddata.es/resources/domains/group1/topics?words=10 • Papers in a Corpus: http://librairy.linkeddata.es/resources/domains/group1/items?size=10 Explore a Corpus
  • 9.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 9 Full-Paper • Info: http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106?content=true • Parts: http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106/parts • abstract: http://librairy.linkeddata.es/resources/parts/adfe85d9634654e4cfd7148be7cd2b29?content=true • approach: http://librairy.linkeddata.es/resources/parts/83f2b9722953034d7b6b50cbead4ec6b?content=true • outcome: http://librairy.linkeddata.es/resources/parts/61452a5ec420c8926160ae748c12a826?content=true • challenge: http://librairy.linkeddata.es/resources/parts/8858ef323fc09efbdcd46b9de45f146c?content=true • background: http://librairy.linkeddata.es/resources/parts/d118ef60d5e874d69d92c6b07be68b61?content=true • future-work: http://librairy.linkeddata.es/resources/parts/92be5400df5bb331e5f7f692e6b05bca?content=true • Topic Distribution of Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-84924147106/topics?words=15 • Topic Distribution of Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/topics?words=15 • Similarity between Full-Paper and Abstract: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&relatedId=adfe85d9634654e4cfd7148be7cd2b29 • Similarity between Full-Paper and Approach content: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&relatedId=83f2b9722953034d7b6b50cbead4ec6b Internal Representativeness
  • 10.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 10 • Similar papers to Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&resourceType=item&size=5 • Similar papers to Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila rity&resourceType=item&size=5 • Similar papers to Approach content: http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil arity&resourceType=item&size=5 • Similar summaries to a Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&resourceType=part&size=5 • Similar summaries to an Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila rity&resourceType=part&size=5 • Similar summaries to Approach: http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil arity&resourceType=part&size=5 External Representativeness
  • 11.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: Size of Summaries 11 The approach, the background and the outcome content of a paper generate more accurate topic distributions than those created from other approaches as the abstract. Since LDA considers documents as bag-of-words, the text length affects the accuracy of the topic distributions inferred by the model Relative size of summaries respect to full-paper Absolute size of summaries (in number of characters)
  • 12.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: Internal Representativeness 12 • The Internal Representativeness of a summary measures the similarity of this summary against the original full-text research paper • This similarity is based on the JSD between the topic distribution of each of them • Results suggest than the distribution of topics describing the text created from the approach content is the most similar to the one corresponding to the full-content of the paper internal-representativeness
  • 13.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: External Representativeness 13 • The External Representativeness of a summary measures how different is the set of related documents obtained with respect to those derived from the original text • Similarity thresholds from 0.5 to 0.99 were considered in experiments precision recall • In terms of recall, the upward trend followed by the approach, the outcome and the background content remarks the assumption of summaries containing key words allow to discover more similar papers than others
  • 14.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: External Representativeness 14 f-measure • For higher similarity thresholds, i.e. for strongly related papers, the recommendations discovered by using the approach are more precise than those discovered by using the abstract.
  • 15.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Conclusions 15 • We have studied the Topic-based similarities among scientific documents based on their abstract sections with respect to summaries corresponding to their scientific discourse categories. • Two novel measures have been proposed: (1) internal- representativeness and (2) external-representativeness. • Results show that summaries created from the approach, outcome or background content of a paper describe more accurately its full-content in terms of overall ideas and related documents than abstracts. • In order to avoid an influence of the size of the summaries on the accuracy of the results, in future work we plan to use probabilistic topic model algorithms oriented to handle short-texts such as BTM to describe texts .
  • 16.
    An initial Analysisof Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain ocorcho@fi.upm.es @ocorcho ISWC’17 oeg-upm.net