Your SlideShare is downloading. ×
How Scientists Read, How Computers Read, and What We Should Do
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

How Scientists Read, How Computers Read, and What We Should Do

2,268
views

Published on

Talk at ALPSP 2012 in the text mining session

Talk at ALPSP 2012 in the text mining session

Published in: Technology, Health & Medicine

1 Comment
3 Likes
Statistics
Notes
  • Interesting peek into a future of science publishing Anita! I learned that scientists want to 'ingest' knowledge which provides publishers with a compelling reason and direction to innovate and provide the right metaknowledge and context with their publications. Yes, I think your approach may bring unique and useful results in the context of discovery. Big issues remain standardisation of the underlying articles and keeping experimental data and meta data strictly separated. For me, the basic scientific article needs to keep it's role as the ultimate verifiable source of any scientific statement.

    But there is room for improvement in how they are written. Unambiguous semantics would provide great fuel for your approach. So, I hope for your next talk you are able to add 'How Scientists Write' to the title.

    Keep up the good work!

    Mark Eligh
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,268
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
1
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. How Scientists Read, How Computers Read, and What We Should Do (= not what it says in the abstract!) Anita de WaardDisruptive Technologies Director Elsevier Labs
  • 2. Outline1. How do scientists read?2. How do computers read?3. What should we do?
  • 3. Outline1. How do scientists read?2. How do computers read?3. What should we do?
  • 4. How we read• Letter < syllable < word < clause < sentence < discourse: This is how linguistics is structured. But it is not how we understand text!
  • 5. How we read• Letter < syllable < word < clause < sentence < discourse: This is how linguistics is structured. But it is not how we understand text!
  • 6. How we read• Letter < syllable < word < clause < sentence < discourse: This is how linguistics is structured. But it is not how we understand text!
  • 7. How we read• Letter < syllable < word < clause < sentence < discourse: This is how linguistics is structured. But it is not how we understand text!
  • 8. How we read• Letter < syllable < word < clause < sentence < discourse: This is how linguistics is structured. But it is not how we understand text!
  • 9. How we read• Letter < syllable < word < clause < sentence < discourse: This is how linguistics is structured. But it is not how we understand text!
  • 10. Scientists read:• Why do scientists read? – They want to ingest knowledge: – read, integrate with their current knowledge• What do scientists read? – Things that are ‘interesting’ : – Pertinent (within their ‘shell of interest’) – Possibly or probably true – Novel, but in agreement with what we know
  • 11. What is this paper about? NOUN PHRASES transiently expressed miRNA sponges human breast cancer high-grade malignancy miR-31 noninvasive MCF7-Ras antisense oligonucleotides cell viability cloned retroviral vectorIs it pertinent? -> Possibly…Is it true? -> ?Is it new, but in agreement with what I know? -> -?
  • 12. What is this paper about? TRIPLES miR-31 expression DEPRIVE metastatic cellsmiR-31 PREVENT acquisition of aggressive traits miR-31 INHIBIT noninvasive MCF7-Ras cells miR-31 ENHANCE invasion cell viability AFFECT inhibitorIs it pertinent? -> Possibly…Is it true? -> ?Is it new, but in agreement with what I know? ->?
  • 13. What is this paper about? METADISCOURSEThe preceding observations demonstrated that X expression deprives Y cells ofattributes associated with Z.We next asked whether X also prevents the acquisition of A traits by B cells.To do so, we transiently inhibited X in C cells with either D or E.Both approaches inhibited X function by > 4.5-fold (Figure S7A).Suppression of X enhanced invasion by 20-fold and motility by 5-fold, but F wasunaffected by either inhibitor (Figure 3A; Figure S7B).The E sponge reduced X function by 2.5-fold, but did not affect the activity of otherknown Js (Figures S8A and S8B).Collectively, these data indicated that sustained X activity is necessary to prevent theacquisition of Z traits by both K and untransformed B cells. Is it pertinent? -> Need content Is it true? -> Sounds likely! I know this stuff! Is it new, but in agreement with what I know? -> Need content
  • 14. What is this paper about? CLAIMS AND EVIDENCEClaim:• sustained miR-31 activity is necessary to prevent the acquisition of aggressive traits by both tumor cells and untransformed breast epithelialEvidence: Method:• We transiently inhibited miR-31 in noninvasive MCF7-Ras cells with either antisense oligonucleotides or miRNA sponges.Evidence: Result:• Both approaches inhibited miR-31 function by >4.5-fold (Figure S7A).• Suppression of miR-31 enhanced invasion by 20-fold and motility by 5- fold, but cell viability was unaffected by either inhibitor (Figure 3A; Figure S7B).• The miR-31 sponge reduced miR-31 function by 2.5-fold, but did not affect the activity of other known antimetastatic miRNAs (Figures S8A and S8B). Is it pertinent? -> Probably Is it true? -> Sounds likely! Is it new, but in agreement with what I know? -> Check/know
  • 15. What is this paper about? DATAIs it pertinent? -> Need contentIs it true? -> Need methodsIs it new, but in agreement with what I know? -> Check/know
  • 16. What is this paper about? METADATAIs it pertinent? -> PossiblyIs it true? -> Probably!Is it new, but in agreement with what I know? -> Need background
  • 17. How scientists read: Representation Pertinence Truth Fit with knowledgeNoun phrases xTriples xMetadiscourse xClaims and evidence x x xData x x xMetadata x Text mining Publishing Data-centric science
  • 18. Outline1. How do scientists read?2. How do computers read?3. What should we do?
  • 19. Noun Phrases: some issues• Problem 1: disambiguating terms (© GoPubMed): – Hnrpa1 = Tis = Fli-2 = nuclear ribonucleoprotein A1 = helix destabilizing protein = single-strand binding protein = hnRNP core protein A1 = HDP-1 = topoisomerase-inhibitor suppressed. – Cellulose 1,4-beta-cellobiosidase = exoglucanase – COLD =/ C.O.L.D. =/ cold (runny nose) =/ cold (low T)• Problem 2: disambiguating entities (© M. Martone): – 95 antibodies were (manually!) identified in 8 articles – 52 did not contain enough information to determine the antibody used – Some provided details in other papers – Failed to give species, clonality, vendor, or catalog number
  • 20. Noun Phrases: some progress• Despite these difficulties, noun phrase recall/precision is quite high, e.g. I2B22011 [1], [2], others: 90%-98%• Many tools, see [3] for a list; e.g. GoPubMed:
  • 21. Triples: some issues:• Contingent on good NP & VP detection• Hard to parse text! E.g. a commercial tool gave:insulin maintaining glucose homeostasisWhen insulin secretion cannot be increased adequately (type Idiabetes defect) to overcome insulin resistance in maintainingglucose homeostasis, hyperglycemia and glucose intoleranceensues.insulin may be involved glucose homeostasisBecause PANDER is expressed by pancreatic beta-cells and inresponse to glucose in a similar way to those of insulin, PANDERmay be involved in glucose homeostasis.
  • 22. Triples: some progress:Biological Expression Language [4]:We provide evidence that these miRNAs are potential novel oncogenes participating in the developmentof human testicular germ cell tumors by numbing the p53 pathway, thus allowing tumorigenic growth inthe presence of wild-type p53.Increased abundance of miR-372 decreases activity of TP53r(MIR:miR-372) -| tscript(p(HUGO:Trp53))Context: cancerSET Disease = “Cancer”Activity of TP53 decreases cell growthtscript(p(HUGO:Trp53)) -| bp(GO:”Cell Growth”
  • 23. Use biological pathway visualizationsas a user interface for knowledge discovery. 23
  • 24. Author-created triples: MSR ActiveText
  • 25. Metadiscourse: why it matters: “[Y]ou can transform .. fiction into fact just by adding or subtracting references”, Bruno Latour [5]• Voorhoeve et al., 2006: “These miRNAs neutralize p53- mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor suppressor LATS2.”• Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373 were found to allow proliferation of primary human cells that express oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor LATS2 (Voorhoeve et al., 2006).”• Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly inhibit the expression of Lats2, thereby allowing tumorigenic growth in the presence of p53 (Voorhoeve et al., 2006).”
  • 26. Adding metadiscourse to triples:Biological statement with BEL/ epistemic BEL representation: Epistemicmarkup evaluationThese miRNAs neutralize p53-mediated CDK r(MIR:miR-372) - Value =inhibition, possibly through direct inhibition |(tscript(p(HUGO:Trp53)) -| Possibleof the expression of the tumor-suppressor kin(p(PFH:”CDK Family”))) Source =LATS2. Increased abundance of miR- Unknown 372 decreases abundance of Basis = LATS2 Unknown r(MIR:miR-372) -| r(HUGO:LATS2)Biological statement with MedScan Analysis: EpistemicMedscan/epistemic markup evaluationFurthermore, we present evidence that the IL-6  NUCB2 (nesfatin-1) Value =secretion of nesfatin-1 into the culture Relation: MolTransport Probablemedia was dramatically increased during the Effect: Positive Source =differentiation of 3T3-L1 preadipocytes into CellType: Adipocytes Authoradipocytes (P < 0.001) and after treatments Cell Line: 3T3-L1 Basis = Datawith TNF-alpha, IL-6, insulin, anddexamethasone (P < 0.01).
  • 27. Claims and Evidence, some examples: Data2Semantics [11]• Linking clinical guidelines to evidence in a linked data form• Goal: improve speed of integration of research > practice• Issue: evidence is not even correct within guideline? • Studies have demonstrated inconsistent results regarding the use of such markers of inflammation as C-reactive protein (CRP), interleukins- 6 (IL-6) and -8, and procalcitonin (PCT) in neutropenic patients with cancer [55–57]. • [55]: PCT and IL-6 are more reliable markers than CRP for predicting bacteremia in patients with febrile neutropenia • [56] In conclusion, daily measurement of PCT or IL-6 could help identify neutropenic patients with a stable course when the fever lasts >3 d. …, it would reduce adverse events and treatment costs. • [57] Our study supports the value of PCT as a reliable tool to predict clinical outcome in febrile neutropenia.
  • 28. Claims and Evidence, example: Drug Interaction Knowledgebase [12]• Extracting adverse drug interactions (ADIs) from literature and creating linked data node of this• Goal: improve speed and coverage of ADIs and allowing improved access to patients and doctors• Issue: how to identify evidence? – Claim: R-citalopram_is_not_substrate_of_cyp2c19: – Evidence: At 10uM R- or S-CT, ketoconazole reduced reaction velocity to 55 - 60% of control, quinidine to 80%, and omeprazole to 80-85% of control (Fig. 6)
  • 29. Data, e.g. Web Science 2.0: Mark Wilkinson (SADI, Madrid)Using what is known about interactions in fly & yeast:predict new interactions with a human protein
  • 30. Wilkinson: doing science ON the web: These are different Web services! ...selected at run-time based on the same model
  • 31. Data• All this evidence is based on data• Increasingly: science is distributed between – Groups creating data – Groups using data – creating tools – Groups using tools on data – ideas• All of these groups need to communicate!
  • 32. In summary:1. How do scientists read?2. How do computers read?3. What should we do?
  • 33. How we read vs. computers:Level: People read: Computers read:Noun phrases Know topic Pretty wellTriples Know topic Pretty wellMetadiscourse Trust method Not very wellClaims and evidence Understand and trust Not very wellData Trust - and new science! Can enable!
  • 34. Is this the future of publishing? [17] 1. Research: Each item in the system has metadata metadata (including provenance) and relations to other data items metadata added to it. 2. Workflow: All data items created in the lab are added to a metadata (lab-owned) workflow system. 3. Authoring: A paper is written in an authoring tool which can pull data with provenance from the workflow tool in the appropriate representation into the document. metadata 4. Editing and review: Once the co-authors agree, the paper is „exposed‟ to the editors, who in turn expose it to reviewers. metadata Reports are stored in the authoring/editing system, the paper gets updated, until it is validated. 5. Publishing and distribution: When a paper is published, a collection of validated information is exposed to the world. It remains connected to its related data item, and Rats were subjected to two grueling its heritage can be traced. tests (click on fig 2 to see underlying data). 6. User applications: distributed applications run on this These results suggest that the neurological pain pro- „exposed data‟ universe. Publisher runsReview Revise service (‘app’) Edit Publisher runs service (‘app’)
  • 35. What should we do?• Experiment! All over the place. Scientists get it !• Support scientists working on these (e.g. text miners, web science evangelists, data repositories, etc etc) – great return for your investment!• Join forums where interactions happen between scientists, publishers, libraries, etc. e.g. Force11.org: – Collective, sponsored by Sloane, aimed at enabling/supporting this discussion – Planning workshop, innovative projects for 2013 – Please join us at http://force11.org!
  • 36. Thank you! Anita de Waard a.dewaard@elsevier.comhttp://elsatglabs.com/labs/anita/
  • 37. References[1] J Am Med Inform Assoc. 2010 September; 17(5): 514–518 http://dx.doi.org/10.1136/jamia.2010.003947[2] Quanzhi Li, Yi-Fang Brook Wu (2006): Identifying important concepts from medical documents, Journal of Biomedical Informatics 39 (2006)668–679[3] Useful list of resources in bioinformatics http://www.bioinformatics.ca/[4] Biological Expression Language – http://www.openbel.org[5] Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications[6] Light M, Qiu XY, Srinivasan P. (2004). The language of bioscience: facts, speculations, and statements in between. BioLINK 2004: LinkingBiological Literature, Ontologies and Databases 2004:17-24.[7] Wilbur WJ, Rzhetsky A, Shatkay H (2006). New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMCBioinformatics 2006, 7:356.[8] Thompson P., Venturi G., McNaught J, Montemagni S, Ananiadou S. (2008). Categorising modality in biomedical texts. Proc. LREC 2008 WkshpBuilding and Evaluating Resources for Biomedical Text Mining 2008.[9] Kim, S-M. Hovy, E.H. (2004). Determining the Sentiment of Opinions. Proceedings of the COLING conference, Geneva, 2004.[10] de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A Taxonomy of Types andOverview of Features. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 47–55, Jeju, Republic ofKorea, 12 July 2012.[11] Data2Semantics project: http://www.data2semantics.org/[12] Boyce R, Collins C, Horn J, Kalet I. (2009) Computing with evidence Part I: A drug-mechanism evidence taxonomy oriented towardconfidence assignment. J Biomed Inform. 2009 Dec;42(6):979-89. Epub 2009 May 10, see also http://dbmi-icode-01.dbmi.pitt.edu/dikb-evidence/front-page.html[13] Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research Articles, Workshop on DetectingStructure in Scholarly Discourse, ACL 2012.[14] Blake, C. (2010) Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles, Journal of BiomedicalInformatics, 43(2):173-189[15] See e.g. http://ucsdbiolit.codeplex.com/ and http://research.microsoft.com/en-us/projects/ontology/ for MS Word ontology add-ins[16] de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA), SemanticTechnologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012[17] de Waard, A. (2010). The Future of the Journal? Integrating research data with scientific discourse, LOGOS: The Journal of the World BookCommunity, Volume 21, Numbers 1-2, 2010 , pp. 7-11(5) also published in NaturePrecedings,http://precedings.nature.com/documents/4742/version/1