Towards the automatic identification of the nature of citations


The reasons why an author cites other publications are varied: an author can cite previous works to gain assistance of some sort in the form of background information, ideas, methods, or to review, critique or refute previous works. The problem is that the best possible way to retrieve the nature of citations is very time consuming: one should read article by article to assign a particular characterisation to each citation. In this work we propose an algorithm, called CiTalO, to infer automatically the function of citations by means of Semantic Web technologies and NLP techniques. We also present some preliminary experiments and discuss some strengths and limitations of this approach.

  • 1. the automatic identificationof the nature of citationsAngelo Di Iorio – diiorio@cs.unibo.itAndrea Giovanni Nuzzolese – nuzzoles@cs.unibo.itSilvio Peroni –
  • 2. Outline• Interpretation of bibliographic citations• CiTalO architecture• Preliminary empirical evaluation• Conclusions
  • 3. Citations as tools• Bibliographic citations can be seen as tools for:✦ linking research: making pointers to related works, to source of experimental data, tomethods used, etc.✦ disseminating research: conference proceedings, journals,Web platforms (e.g. blogs,wikis), Semantic Publishing platforms and projects (e.g. OpenCitation,OpenBibliography, Lucero)✦ exploring research: new ways of browsing article through networks of citations (e.g.CiteWiz, Citation Sensitive In-browser Summariser)✦ evaluating research: measuring the importance of journals (e.g. impact factor) or thescientific productivity of authors (e.g. h-index)• This work is based on the assumption that all these activities can beradically improved by exploiting the actual function of citations,’s reason for citing a given paper✦ Could a paper that is cited many times negatively be given a high score?✦ Could a paper containing several self-citations be given the same score of a paper withheterogeneous citations?
  • 4. Intended vs. interpretedmeaning of citation functionsIt extends the research outlined in earlier work [3].Intended meaningInterpreted meaningThe author of the textA reader of the text Another reader of the text“Text as a medium through which anauthor transmits knowledge”“The reader decodes this meaningand literally ‘make sense’ of it”“The meaning is notcontained within the wordsthemselves but in theminds of the participants.”Quotations from From Proteins to Fairytales: Directions in Semantic Publishing by Anita De Waard, DOI: 10.1109/MIS.2010.49I want to convey aparticular meaning throughthis textI recognisea particular meaning byreading that textI recognisea particular meaning byreading that textThese three“meanings” arenot the same
  • 5. settingIt extends the research outlined in earlier work [3].Interpreted meaningA reader of the text Another reader of the textI recognisea particular meaning byreading that textI recognisea particular meaning byreading that textI recognise a particularmeaning by reading that textCiTalO (English translation: cite it) is a toolthat recognises the function of citations byexploiting Semantic Web technologies(CiTO, FRED, SPARQL) and NLP techniques(IMS,AlchemyAPI)
  • 6. pipelineIt extends the researchoutlined in earlier work X.OntologylearningCitation typeextractionWord-sensedisambiguationAlignmentto CiTOSentimentanalysisoutput:cito:extendsInput: a sentencecontaining areference to abibliographic entityindicated by an “X”Derive a logical ( OWL ontology)representation ofthe sentencethrough FREDExtract candidatetypes for the citationby looking forpatterns in FREDoutput via SPARQLGather the sense ofthe candidate typesthrough IMS withrespect toOntoWordNetCapture the sentimentpolarity emergingfrom the text throughAlchemyAPIAssign CiTO typesto the citationthrough SPARQLCONSTRUCT
  • 7. Ontology learning andcitation type extractionFRED outputIt is returned as an OWL ontology in RDF/XML formatSELECT ?type WHERE {?subj ?prop fred:X; a ?typeTmp.?typeTmp rdfs:subClassOf+ ?type}SELECT ?type WHERE {?subj ?prop fred:X; a ?type}SELECT ?type WHERE {?subj a dul:Event;boxer:patient ?patient. ?patient a ?type}SELECT ?type WHERE {?subj a dul:Event,?type. FILTER(?type != dul:Event)}CandidatesextractionLooking for patternsthrough SPARQL
  • 8. Word-sense disambiguation andalignment to CiTO• We use IMS (a word-sense disambiguator) to disambiguate the candidatetypes found, with respect to OntoWordNet (OWN)✦ EarlierWork and Work: own:synset-work-noun-1✦ Extend: own:synset-prolong-verb-1✦ Outline: own:synset-delineate-verb-3✦ Research: own:synset-research-noun-1It is possible to extendthe set of retrievedsynsets by adding theproximal onesautomaticallyWordnetsynsetsretrievedbyIMSCiTO2Wordnet is anontology we developedto maps all the CiTOproperties definingcitations with relatedWordnet synsetsCiTO is an ontologythat definesfunctions of citationsas object properties• Finally we perform SPARQLCONSTRUCT to align the synsets tothose defined in the CiTO2Wordnetontology, which imports CiTO✦ Properties in CiTO having opposite polarity(that is defined in the CiTOFunctionsontology) to what is returned by thesentiment analysis module are notconsidered✦ If no alignment is performed,cito:citesForInformation is returned as defaultcito:extendsskos:closeMatchown:synset-prolong-verb-1 .cito:extendsis selected!
  • 9. A preliminary empirical evaluation• Comparing the results of CiTalO with a human classification of citations• The test bed we used for our experiments includes some scientific papers (written inEnglish) encoded in DocBook, containing citations of different types• We automatically extracted citation sentences, through an XSLT document, from allthe papers published in the 7th volume of Balisage Proceedings✦ 18 scientific papers written by different authors✦ 377 citations (20.94 citations per paper)• We filtered all the citation sentences from the selected articles✦ The annotation of citation functions is an hard problem to address✦ We kept only 106 citations, which were accompanied by verbs and/or other grammatical structurescarrying explicitly a particular citation function, as suggested by Teufel et al. [18]• We marked (according to CiTO properties) these 106 citations, obtaining at leastone representative citation for each of the 18 paper used (5.89 citations per paper)• We used 21 CiTO properties out of 38 to annotate all these citations
  • 10. CiTalO setting• We run CiTalO on these 106 citation sentences and comparedresults with our annotations• We also tested eight different configurations of CiTalO,corresponding to all possible combinations of three options:✦ activating or deactivating the sentiment-analysis module;✦ applying or not the proximal synsets to the IMS output;✦ using an extended version of CiTO2Wordnet that includes all the synsetsWordnet retrieves for a particular string (including those synsets having thegloss radically different to the CiTO property in consideration)CiTO2Wordnetdoes not link cito:extends to that synsetCiTO2Wordnet – extendeddo link the synset with cito:extendsConsider the synset own:synset-credit-verb-3, having gloss“accounting: enter as credit”and the CiTO definition for cito:credits“the citing entity acknowledges contributions made by the cited entity”
  • 11. ResultsKinds and number of citations marked by usSimilarly to Teufel et al. [19] the mostneutral CiTO property,citesForInformation, was the mostprevalent function in our datasettoo, as the second most usedproperty was usedMethodInPrecision and recall of CiTalO according to different configurationsNoconfigurationthat emerges asthe absolutelybest one fromthese dataWorstconfigurationswere those thattook intoaccount all theproximal synsets
  • 12. Conclusions• The implementation of CiTalO is still at an early stage• Current experiments are not enough to fully validate this approach• However, the goal of this work was to build such a modulararchitecture, to perform some exploratory experiments and toidentify issues and possible developments of our approach• Future works:✦ Propose extensions to CiTO to cover more scenarios (e.g. cito:speculatesOnand cito:citesAsPotentialSolution)✦ Decrease the noise given by proximal synsets✦ Identification of anaphoras (e.g. nouns, pronouns) speaking about the citations✦ Perform exhaustive tests with a larger set of documents and users
  • 13. Thanks for your attentionPlease come to see CiTalO in actionduring the ESWC Demo Session on Tuesday