Maryann E. Martone, Ph, D,Neuroscience Information FrameworkUniversity of California, San Diego
Themes Computers are now partners with humans in reading theliterature Search Summarization Linking Discovery The sc...
 NIF is an initiative of the NIH Blueprint consortium ofinstitutes What types of resources (data, tools, materials, serv...
The Neuroscience Information Framework: Discoveryand utilization of web-based resources for neuroscience A portal for fin...
In an ideal information system, wewould be able to find… What is known “What studies used my monoclonal mouse antibodyag...
Whither biologicalinformation?∞What is easily machineprocessable and accessibleWhat is potentiallyknowableWhat is known:Li...
CA2: Ion, Brain Part orGene?BioGridAllen Brain AtlasBrain InfoNIF queriesacross over170+independentdatabases
Papers are the currency of science Despite the wealth of data out there (> 2500databases on-line), the majority of data i...
Mining the literature for resources Resources: Materials, services, tools, data Project 1: Find materials: antibodies an...
Linking resources: Link out broker
Use case: antibodies Pilot project to use text mining to identify antibodies used instudies: Wanted to pick a project tha...
Our reagents and methods arenot perfect“We note that many of the findings in the literature about neuronal NF-κB arebased ...
Antibodies are complexentities Anti-Chat antibody Raised against a portion ofcholineacetyltransferase Raised in a parti...
“Find studies that used a rabbit polyclonal antibodyagainst GFAP that recognizes human inimmunocytochemisty”Paz et al,J Ne...
Searching for resources in literature NIF recentlyimplemented asection-specificsearch Semi-automatedresourceidentificati...
Annotation of antibodies•Allows annotation ofentities and keyrelationships:•Protocol•Subject ofprotocol•Links antibodies t...
What studies used my monoclonal mouseantibody against actin in humans? Midfrontal cortex tissue samples from neurological...
Tracking down reagentsFeng et al., MATH5 controls the acquisition of multiple retinal cell fates, Mol Brain. 2010; 3: 36
Space limitationsContentgetsseparated in space and timePractices are designed to savespace, improve readability andsave a...
Try this Watson!• 95 antibodies were identified in 8 articles• 52 did not contain enough information to determinethe antib...
Subject of study Often not explicit: “patients with AD” = human Type III SMA mice (Smn−/−, SMN2+/−) were produced as pr...
Which mouse did you use? “Transgenic mice expressing SOD1G93A (12)were purchased from Jackson Laboratory” 12 = Gurney ME...
Minimal metadata standards (really) forpublishing in the 21st century 1) Provide gene accession numbers for all genesrefe...
Project 2: Extracting data fromtables and supplementary material Challenge: Extract data on geneexpression in brain from ...
Extracting additional knowledge fromsupplementary materialGene for tyrosinehydroxylase hasincreasedexpression in locuscoer...
Challenges working with tables andsupplemental data Difficult data arrangements○ PDF, JPG, TXT, CSV, XLS○ Difficult style...
Is SMN1 affected by drugs ofabuse?SMN1 is the gene that is mutated in Spinal Muscular Atrophy, a neurological disease ofch...
Open world vs closed worldassumptions Closed world assumption: holds that any statement that is not known to be true is ...
 We measured the expression of 9000 genes as afunction of chronic cocaine (S1). The 50 genes thatshowed significantly inc...
Narrative vs Data publishing Narrative (Author): Encourage use of minimal standardsfor key entities in the research paper...
Conclusions Humans are storytellers; it’s fundamental to the waywe communicate But these stories are directed to an audi...
Upcoming SlideShare
Loading in …5
×

Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATION FRAMEWORK

1,141 views

Published on

Maryann Martone

Making Sense of Biological Systems: Using Knowledge Mining to Improve and Validate Models of Living Systems; NIH COBRE Center for the Analysis of Cellular Mechanisms and Systems Biology, Montana State University, Bozeman, MT

August 24, 2012

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,141
On SlideShare
0
From Embeds
0
Number of Embeds
793
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATION FRAMEWORK

  1. 1. Maryann E. Martone, Ph, D,Neuroscience Information FrameworkUniversity of California, San Diego
  2. 2. Themes Computers are now partners with humans in reading theliterature Search Summarization Linking Discovery The scientific paper starts with the materials and methods All observations, claims etc flow from experimental design andmaterials If authors do not provide this information in the first place, then wecan’t use it to improve all of the above Scientists produce articles for each other, not forcomputers Not everything you need to interpret the paper is in the paper More information may be there than is in the text
  3. 3.  NIF is an initiative of the NIH Blueprint consortium ofinstitutes What types of resources (data, tools, materials, services) are available tothe neuroscience community? How many are there? What domains do they cover? What domains do they not cover? Where are they?○ Web sites○ Databases○ Literature○ Supplementary material Who uses them? Who creates them? How can we find them? How can we make them better in the future? http://neuinfo.org• PDF files• Desk drawersNIF provides a wealth of practicalinformation on data and resourceissues in neuroscience
  4. 4. The Neuroscience Information Framework: Discoveryand utilization of web-based resources for neuroscience A portal for finding andusing neuroscienceresources A consistent frameworkfor describingresources Provides simultaneoussearch of multiple typesof information,organized by category Supported by anexpansive ontology forneuroscience Utilizes advancedtechnologies to searchthe “hidden web”http://neuinfo.orgUCSD, Yale, Cal Tech, George Mason, Washington UnivSupported by NIH BlueprintLiterature22 milDataFederation350 milResourceRegistry5000
  5. 5. In an ideal information system, wewould be able to find… What is known “What studies used my monoclonal mouse antibodyagainst actin in humans?” “What phenotypes are associated with each mousemodel of Spinal Muscular Atrophy” “What upregulates SMN1?” What is not known Connect information to infer plausible hypotheses○ Genotype-phenotype○ Possible drug targets Information gaps
  6. 6. Whither biologicalinformation?∞What is easily machineprocessable and accessibleWhat is potentiallyknowableWhat is known:Literature, images, humanknowledge
  7. 7. CA2: Ion, Brain Part orGene?BioGridAllen Brain AtlasBrain InfoNIF queriesacross over170+independentdatabases
  8. 8. Papers are the currency of science Despite the wealth of data out there (> 2500databases on-line), the majority of data is stillpublished in papers But...we write for other humans to consume andinformation continues to be hard to find Even for humans, however, it is difficult to find and verify basicinformation about a paper critical for interpretation What is the subject of the study What reagents were used What genes were studied A lot of information is missing from papers Not all data is available Data is published in papers in forms that are difficult to use
  9. 9. Mining the literature for resources Resources: Materials, services, tools, data Project 1: Find materials: antibodies andtransgenic animals Project 2: Mine supplemental data in papersshowing gene expression changes in drug abuse Purpose Find new resources Track usage of existing resources Link resources to other useful information
  10. 10. Linking resources: Link out broker
  11. 11. Use case: antibodies Pilot project to use text mining to identify antibodies used instudies: Wanted to pick a project that would be immediately understandable byresearch scientists Antibodies are used routinely to identify proteins and othermolecules in basic and translational studies Antibodies are a large source of experimental variability inresults Same antibody can give you very different results Different antibodies to the same protein can give you very different results Neuroscientists spend a lot of time tracking downantibodies and trouble shooting experiments that useantibodies
  12. 12. Our reagents and methods arenot perfect“We note that many of the findings in the literature about neuronal NF-κB arebased on data garnered with antibodies that are not selective for the NF-κBsubunit proteins p65 and p50. The data urge caution in interpreting studies ofneuronal NF-κB activity in the brain.”--Herkenham et al., J Neuroinflammation. 2011; 8: 141.
  13. 13. Antibodies are complexentities Anti-Chat antibody Raised against a portion ofcholineacetyltransferase Raised in a particular species Is polyclonal or monoclonal Is affinity purified or not Recognizes the target in some species, e.g.,human Reported in materials and methodsTissue sections were blocked with 5% serum and incubated overnightat 4 °C with the following primary antibodies: anti-ChAT (1:100;Millipore, Billerica, MA), anti-Bax (1:50; Santa Cruz), anti-Bcl-xl (1:50;Cell Signaling), anti- neurofilament 200 kDa (1:200; Millipore) ...
  14. 14. “Find studies that used a rabbit polyclonal antibodyagainst GFAP that recognizes human inimmunocytochemisty”Paz et al,J Neurosci,2010(AB_310775)NIFAntibodyRegistry:-database of> 900,000antibodies
  15. 15. Searching for resources in literature NIF recentlyimplemented asection-specificsearch Semi-automatedresourceidentificationpipeline Paul Sternberg,Yuling Li, Cal Tech
  16. 16. Annotation of antibodies•Allows annotation ofentities and keyrelationships:•Protocol•Subject ofprotocol•Links antibodies to adatabase ofantibodies thatcontains theirproperties•NIF AntibodyRegistry•900,000antibodies•Unique IDhttp://antibodyregistry.org http://annotationframework.org/DOMEO annotation tool: PaoloCiccarese; Tim Clark, MGH
  17. 17. What studies used my monoclonal mouseantibody against actin in humans? Midfrontal cortex tissue samples from neurologicallyunimpaired subjects (n9) and from subjects with AD(n11)were obtained from the Rapid Autopsy Program Immunoblot analysis and antibodies The following antibodies were used for immunoblotting:-actinmAb (1:10,000 dilution, Sigma-Aldrich); -tubulinmAb (1:10,000,Abcam); T46 mAb (specific to tau 404–441, 1:1000, Invitrogen); Tau-5 mAb(human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007); AT8mAb (phospho-tau Ser199, Ser202, and Thr205, 1:500, Innogenetics);PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies);12E8 mAb(phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert);NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, SantaCruz Biotechnology)…Subject isHumanmAb=monoclonalantibody
  18. 18. Tracking down reagentsFeng et al., MATH5 controls the acquisition of multiple retinal cell fates, Mol Brain. 2010; 3: 36
  19. 19. Space limitationsContentgetsseparated in space and timePractices are designed to savespace, improve readability andsave authors typing But...electrons are cheap Cut and paste is cheap Re-examining plagiarism in the age ofcut and paste Autocomplete is cheap Acronyms and abbreviations Are there any unique 3 letter strings Formats are flexible What the computer sees and whathumans see don’t have to be the samething
  20. 20. Try this Watson!• 95 antibodies were identified in 8 articles• 52 did not contain enough information to determinethe antibody used• Some provided details in another paper• And another paper, and another...• Failed to give species, clonality, vendor, or catalog number• But, many provided the location of the vendorbecause the instructions to authors said to do so
  21. 21. Subject of study Often not explicit: “patients with AD” = human Type III SMA mice (Smn−/−, SMN2+/−) were produced as previouslydescribed (Tsai et al., 2006a). Official strain nomenclature of animals not designed for search SMN2Ahmb89tg/tg;SMNΔ7tg/tg:Smn1−/−; no unique identifier assigned Many lines of transgenics are generated and described within a singlepaper; difficult to relate individual findings with the correct animal line butall are not equivalentThree lines of transgenic mice, Ml, M2, and M3, were produced (Fig. 1B).Transgene expression was found in all tissues studied, with widespreadhigh expression in line Ml, high expression in brain of line M3, andrelatively low expression in brain of line M2 (Fig. 1C). (Ripps et al., PNAS,USA Vol. 92, pp. 689-693, January 1995)
  22. 22. Which mouse did you use? “Transgenic mice expressing SOD1G93A (12)were purchased from Jackson Laboratory” 12 = Gurney ME; et al. 1994. Motor neuron degeneration in mice thatexpress a human Cu,Zn superoxide dismutase mutation [seecomments] [published erratum appears in Science 1995 Jul14;269(5221):149] Science 264(5166):1772-5. Search NIF/Jackson lab for “Gurney SOD”○ 7 entries for same producer○ 3 track to the same reference Gogliotti et al, BiochemBiophys Res Commun. 2010January 1; 391(1): 517. “Here we report our findings for the SMA mouse model that has beendeposited by the Li group from Taiwan. These mice, JAX stock numberTJL-005058, are homozygous for the SMN2 transgene,Tg(SMN2)2Hung, and a targeted Smn allele that lacks exon 7,Smn1tm1Hung.”
  23. 23. Minimal metadata standards (really) forpublishing in the 21st century 1) Provide gene accession numbers for all genesreferenced in the methods section of a paper, perhttp://www.ncbi.nlm.nih.gov/gene 2) Identify (i.e., give ID) the species for thesubject of a study, and from which each geneproduct is derived, using the NCBI taxonomy andthe strains from the model organism databases formice, rats, worms, zebrafish and drosophila,employing any existing unique identifiers andcorrect species-specific nomenclature: 3) Provide catalog numbers and vendorinformation for all reagents and animals describedin the methods section of a paperDeveloped by the Link Animal Model to HumanDisease Initiative (LAMHDI) consortium:Journal of Comparative Neurology: Requires completecharacterization of antibody as stated in instructions toauthors•90% of antibodies had a catalog #; 20% had a lot number afterthese policies were instituted•NIF could automatically identify 80% of these antibodiesthrough matching with NIF Antibody Registry
  24. 24. Project 2: Extracting data fromtables and supplementary material Challenge: Extract data on geneexpression in brain from studies relevantto drug abuse Workflow:Find articlesExtract resultsfrom tablesStandardizeresultsLoad into NIFDrug related gene database: 140 tables from 54 articlesAndrea Arnaud-Stagg, Anita Bandrowski
  25. 25. Extracting additional knowledge fromsupplementary materialGene for tyrosinehydroxylase hasincreasedexpression in locuscoeruleus of mousecompared to controlwhen given chronicmorphineTranslations:Upregulatedp< 0.05 =increased expressionLC = locus coeruleusProbe ID = gene nameJ Neurosci. 2005 Jun 22;25(25):6005-15.
  26. 26. Challenges working with tables andsupplemental data Difficult data arrangements○ PDF, JPG, TXT, CSV, XLS○ Difficult styles: colors, symbols, data arrangements (resultscombined into one column, multiple comparisons in one table,legends defining values, unclearly described data (e.g.,unclear significance) Not clear what tables/values represent nothing in paper about the supplementary data file and table has no heading Probe ID’s are given but not gene identifiers No link from supplemental material back toarticle; lose provenance Not all results are accounted for
  27. 27. Is SMN1 affected by drugs ofabuse?SMN1 is the gene that is mutated in Spinal Muscular Atrophy, a neurological disease ofchildren
  28. 28. Open world vs closed worldassumptions Closed world assumption: holds that any statement that is not known to be true is false allows an agent to infer, from its lack of knowledge of a statementbeing true, anything that follows from that statement being false typically applies when a system has complete control overinformation Open world assumption: the assumption that the truth-value of a statement is independent ofwhether or not it is known by any single observer or agent to be true. limits the kinds of inference and deductions an agent can make tothose that follow from statements that are known to the agent to betrue the open world assumption applies when we represent knowledgewithin a system as we discover it, and where we cannot guaranteethat we have discovered or will discover complete information.
  29. 29.  We measured the expression of 9000 genes as afunction of chronic cocaine (S1). The 50 genes thatshowed significantly increased expression (p> 0.01) areshown in Table 2 What about the other 8950 genes? Cannot assume that they were increased, decreasedor remained the same (Open world) We measured the expression of 9000 genes as afunction of chronic cocaine (S1). The fold change andp value are given for each gene.The 50 genes thatshowed significantly increased expression (p>0.01) areshown in Table 2(Closed world)Reporting data: Closing theopen world
  30. 30. Narrative vs Data publishing Narrative (Author): Encourage use of minimal standardsfor key entities in the research paper Subject, protocol, genes, reagents○ Make it easy to find accession numbers Standard templates for reporting supplemental data?○ Unlikely although desired Tools for linking in line references to fragments of papers ratherthan the entire paper Data (Curators): Structuring data requires expertise Positive and negative results equally important If data are to be published in supplemental material or in paper,should make them machine interpretable Ideally, entire data set should be deposited in a publicrepository, e.g., GEO OMNIBUS
  31. 31. Conclusions Humans are storytellers; it’s fundamental to the waywe communicate But these stories are directed to an audience with expertise Scientists know each other’s work; personal networks veryimportant- The computer isn’t part of this So...we need to adapt publishing practices to aidautomated search and mining of content Partnership between authors, publishers, curators and computerscientists, informaticians... Future of research communications and e-scholarship http://force11.org JOIN US!

×