Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
ocorcho...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivat...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Motivat...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Represe...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Probabi...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Represe...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluat...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluat...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluat...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Evaluat...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Results...
An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts
Conclus...
An initial Analysis of
Topic-based Similarity
among Scientific Documents
based on their
Rhetorical Discourse Parts
Carlos ...
Upcoming SlideShare
Loading in …5
×

An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts

258 views

Published on

Presentation given at the SemSci2017 workshop (https://semsci.github.io/semSci2017/), for the paper "An Initial Analysis of Topic-based Similarity among Scientific Documents Based on their Rhetorical Discourse Parts" http://ceur-ws.org/Vol-1931/paper-03.pdf

Published in: Software
  • Be the first to comment

  • Be the first to like this

An initial analysis of topic-based similarity among scientific documents based on their rhetorical discourse parts

  1. 1. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts ocorcho@fi.upm.es @ocorcho ISWC’17 oeg-upm.net Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain
  2. 2. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Motivation 2 How representative is an abstract? Scientific Research Practitioners Reviewers
  3. 3. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Motivation 3 How representative are summaries based on scientific discourse categories? Scientific Research Practitioners Reviewers approach challenge background outcome future work
  4. 4. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Representativeness 4 Full-Paper Summary Internal External finding related items describing main ideas
  5. 5. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Probabilistic Topic Models 5 • Each document is a mixture of corpus-wide topics • Each topic is a distribution over words • Each word is drawn from one of those topics Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, Latent Dirichlet Allocation (LDA)
  6. 6. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Representativeness Measure 6 Internal External precision / recall / f-measure JSD-based similarity [d1,d2,d3,..dn] [s1,s2,s3,..sn] [h1,h2,..hn] [j1,j2,..jn] [j1,j2,..jn] [k1,k2,..kn] [m1,m2,..mn] Full-Paper Summary JSD-based similarity JSD-based similarity • Feature vectors in Topic Models are topic distributions expressed as vectors of probabi • The similarity measure used in our analysis is based on the Jensen Shannon-Divergen
  7. 7. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 7 Advances in Space Research Procedia Chemistry Journal of Pharmaceutical Analysis Journal of Web Semantics Elsevier API 1000 papers ( + abstracts) Topic Model discover rhetorical parts training (only full-papers) inference 1000 papers ( + abstracts, + discourse parts) network of related papers ( + abstracts + discourse parts)
  8. 8. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 8 Advances in Space Research Corpus Procedia Chemistry Corpus Journal of Pharmaceutical Analysis Corpus Journal of Web Semantics Corpus • http://librairy.linkeddata.es/resources/domains/aisr Test Corpus • http://librairy.linkeddata.es/resources/domains/pc • http://librairy.linkeddata.es/resources/domains/jopa • http://librairy.linkeddata.es/resources/domains/jows • http://librairy.linkeddata.es/resources/domains/group1 • Topics in a Corpus: http://librairy.linkeddata.es/resources/domains/group1/topics?words=10 • Papers in a Corpus: http://librairy.linkeddata.es/resources/domains/group1/items?size=10 Explore a Corpus
  9. 9. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 9 Full-Paper • Info: http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106?content=true • Parts: http://librairy.linkeddata.es/resources/items/2-s2.0-84924147106/parts • abstract: http://librairy.linkeddata.es/resources/parts/adfe85d9634654e4cfd7148be7cd2b29?content=true • approach: http://librairy.linkeddata.es/resources/parts/83f2b9722953034d7b6b50cbead4ec6b?content=true • outcome: http://librairy.linkeddata.es/resources/parts/61452a5ec420c8926160ae748c12a826?content=true • challenge: http://librairy.linkeddata.es/resources/parts/8858ef323fc09efbdcd46b9de45f146c?content=true • background: http://librairy.linkeddata.es/resources/parts/d118ef60d5e874d69d92c6b07be68b61?content=true • future-work: http://librairy.linkeddata.es/resources/parts/92be5400df5bb331e5f7f692e6b05bca?content=true • Topic Distribution of Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0-84924147106/topics?words=15 • Topic Distribution of Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/topics?words=15 • Similarity between Full-Paper and Abstract: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&relatedId=adfe85d9634654e4cfd7148be7cd2b29 • Similarity between Full-Paper and Approach content: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&relatedId=83f2b9722953034d7b6b50cbead4ec6b Internal Representativeness
  10. 10. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Evaluation 10 • Similar papers to Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&resourceType=item&size=5 • Similar papers to Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila rity&resourceType=item&size=5 • Similar papers to Approach content: http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil arity&resourceType=item&size=5 • Similar summaries to a Full-Paper: http://librairy.linkeddata.es/resources/domains/group1/items/2-s2.0- 84924147106/relations?type=similarity&resourceType=part&size=5 • Similar summaries to an Abstract: http://librairy.linkeddata.es/resources/domains/group1/parts/adfe85d9634654e4cfd7148be7cd2b29/relations?type=simila rity&resourceType=part&size=5 • Similar summaries to Approach: http://librairy.linkeddata.es/resources/domains/group1/parts/83f2b9722953034d7b6b50cbead4ec6b/relations?type=simil arity&resourceType=part&size=5 External Representativeness
  11. 11. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: Size of Summaries 11 The approach, the background and the outcome content of a paper generate more accurate topic distributions than those created from other approaches as the abstract. Since LDA considers documents as bag-of-words, the text length affects the accuracy of the topic distributions inferred by the model Relative size of summaries respect to full-paper Absolute size of summaries (in number of characters)
  12. 12. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: Internal Representativeness 12 • The Internal Representativeness of a summary measures the similarity of this summary against the original full-text research paper • This similarity is based on the JSD between the topic distribution of each of them • Results suggest than the distribution of topics describing the text created from the approach content is the most similar to the one corresponding to the full-content of the paper internal-representativeness
  13. 13. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: External Representativeness 13 • The External Representativeness of a summary measures how different is the set of related documents obtained with respect to those derived from the original text • Similarity thresholds from 0.5 to 0.99 were considered in experiments precision recall • In terms of recall, the upward trend followed by the approach, the outcome and the background content remarks the assumption of summaries containing key words allow to discover more similar papers than others
  14. 14. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Results: External Representativeness 14 f-measure • For higher similarity thresholds, i.e. for strongly related papers, the recommendations discovered by using the approach are more precise than those discovered by using the abstract.
  15. 15. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Conclusions 15 • We have studied the Topic-based similarities among scientific documents based on their abstract sections with respect to summaries corresponding to their scientific discourse categories. • Two novel measures have been proposed: (1) internal- representativeness and (2) external-representativeness. • Results show that summaries created from the approach, outcome or background content of a paper describe more accurately its full-content in terms of overall ideas and related documents than abstracts. • In order to avoid an influence of the size of the summaries on the accuracy of the results, in future work we plan to use probabilistic topic model algorithms oriented to handle short-texts such as BTM to describe texts .
  16. 16. An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain ocorcho@fi.upm.es @ocorcho ISWC’17 oeg-upm.net

×