Biomedical Annotation

Kevin Livingston, Ph.D.
Postdoctoral Fellow
Pharmacology Department, School of Medicine
University of Colorado Anschutz Medical Campus




                                        Kevin.Livingston@ucdenver.edu
                      http://compbio.ucdenver.edu/Hunter_lab/Livingston
Biomedical researchers are interested in
understanding their data in the context of
   all known background knowledge:
     curated databases & literature.




                                             2
Pubmed Growth Rate
                          1100                                                                                                    25

                          1000
                                                                                                 y = ~e0.0405x
                           900                                                                    R² = 0.99
                                                                                                                                  20
                           800
New Entries (thousands)




                                                                                                                                       Total Entries (millions)
                           700
                                                                                                                                  15
                           600

                           500                                                                      y = ~e0.0402x
                                                                                                     R² = 0.94                    10
                           400

                           300

                           200                                                                                                    5

                           100
                                                                                                                                                                  2 journal
                             0                                                                                                    0
                                                                                                                                                                   articles
                                                                                                                                                                     per
                                 1987

                                        1989

                                                1991

                                                       1993

                                                              1995

                                                                     1997

                                                                            1999

                                                                                   2001

                                                                                          2003

                                                                                                    2005

                                                                                                           2007

                                                                                                                    2009

                                                                                                                           2011
                                               973,499 PubMed entries in 2011 (>2,600 per day)                                                                    minute!

                                                                                                                                                                              3
Biomedical Data Sources
                  Total Manual GO
                    Annotations:
                     1,116,848

        1,380       Total GO
      Database     Annotations:
      s in 2012    132,425,702

                  PubMed Articles
                   Referenced:
                     94,518

                                    4
Annotation Consumers?
• The linguistic community typically uses
  annotation as training data or for specific tasks
  – An abundance of tools that can produce annotations
    in the specific format of those resources
  – Tools for computational linguistics
• Biomedical annotation typically used for
  curating, indexing, or enrichment analysis



• But what about re-using annotations and tools in
  other contexts and for other purposes?
                                                         5
6
Vision
                           Intelligent
   DBs
                           Application
                                s
Ontologies   Knowledge
               Base



             Text Mining
  Texts

                                         7
Applications: Gene Centric




                             8
Applications: Document
        Centric




                         9
Annotation for Computation
• Computer understandable
• Composable
• Provenance of compositions traceable




                                         10
CRAFT:
       Colorado Richly Annotated Full Text corpus
http://bionlp-
corpora.sourceforge.net/CRAFT/

•   67 full text articles (+30 more reserved for future testing)
•   >560,000 Tokens
•   >21,000 Sentences

•   ~100,000 concept annotations to
    7 different biomedical ontologies/terminologies
•   Penn Treebank markup for each sentence

•   Multiple output formats available
                                                                   11
CRAFT Annotation

 hemopoiesis       has agent             results in regulation by    transcription
                                         entity that has function    corepressor
                                                                         activity
                                           biological
                binding                   regulation         transcription
                                    results in
                                       protein                coactivator
            results in              regulation by               activity
            interaction of
                                                     regulates
         DNA                   protein        transcription



Hematopoiesis is precisely orchestrated by lineage-specific
         DNA-binding proteins that regulate transcription
                                  in concert with coactivators and corepressors.

         GO     GO
CHEBI                     SO relation                                            12
         BP     MF
Applications: Annotating




                           13
Compositional Annotation
                   & Knowledge
                                        vertebrate
                                       pigmentatio
                                             n
                    occurs_in                 denotes      subClassOf
                                    text annotation 3

               TAXON:7742                                  GO:0043474
                                basedO          basedO
                Vertebrata                                 pigmentation
                                n               n
                 hasBody                                     hasBody
   CRAFT
PMID:1473718           text annotation 1      text annotation 2
      3              hasTarget                        hasTarget




                                                                          14
Summary
• Model that covers syntactic and semantic
  annotation
  – Linguistic annotation
  – Semantic annotation
  – Entity-based annotation
• Capture complex content that is not necessarily
  best represented via a single URI
  – Created a GraphAnnotation
    that denotes a RDF named graph
• Add kiao:basedOn to enable annotation
  compositions and provenance tracking
  – Annotation-level
                                                    15
Acknowledgements
University of Colorado:   •   National ICT Australia
• Hunter Lab                  – Karin Verspoor
   –   Larry Hunter
   –   Mike Bada          •   Funding:
   –   Bill Baumgartner       – NIH/NLM training grant
   –   Chris Roeder           – Andrew W. Mellon Foundation
   –   Kevin Cohen
   –   Carsten Goerg




                                                              16
Biomedical Annotation

Kevin Livingston, Ph.D.
Postdoctoral Fellow
Pharmacology Department, School of Medicine
University of Colorado Anschutz Medical Campus




                                        Kevin.Livingston@ucdenver.edu
                      http://compbio.ucdenver.edu/Hunter_lab/Livingston

Biomedical Annotation - Kevin Livingston

  • 1.
    Biomedical Annotation Kevin Livingston,Ph.D. Postdoctoral Fellow Pharmacology Department, School of Medicine University of Colorado Anschutz Medical Campus Kevin.Livingston@ucdenver.edu http://compbio.ucdenver.edu/Hunter_lab/Livingston
  • 2.
    Biomedical researchers areinterested in understanding their data in the context of all known background knowledge: curated databases & literature. 2
  • 3.
    Pubmed Growth Rate 1100 25 1000 y = ~e0.0405x 900 R² = 0.99 20 800 New Entries (thousands) Total Entries (millions) 700 15 600 500 y = ~e0.0402x R² = 0.94 10 400 300 200 5 100 2 journal 0 0 articles per 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 973,499 PubMed entries in 2011 (>2,600 per day) minute! 3
  • 4.
    Biomedical Data Sources Total Manual GO Annotations: 1,116,848 1,380 Total GO Database Annotations: s in 2012 132,425,702 PubMed Articles Referenced: 94,518 4
  • 5.
    Annotation Consumers? • Thelinguistic community typically uses annotation as training data or for specific tasks – An abundance of tools that can produce annotations in the specific format of those resources – Tools for computational linguistics • Biomedical annotation typically used for curating, indexing, or enrichment analysis • But what about re-using annotations and tools in other contexts and for other purposes? 5
  • 6.
  • 7.
    Vision Intelligent DBs Application s Ontologies Knowledge Base Text Mining Texts 7
  • 8.
  • 9.
  • 10.
    Annotation for Computation •Computer understandable • Composable • Provenance of compositions traceable 10
  • 11.
    CRAFT: Colorado Richly Annotated Full Text corpus http://bionlp- corpora.sourceforge.net/CRAFT/ • 67 full text articles (+30 more reserved for future testing) • >560,000 Tokens • >21,000 Sentences • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies • Penn Treebank markup for each sentence • Multiple output formats available 11
  • 12.
    CRAFT Annotation hemopoiesis has agent results in regulation by transcription entity that has function corepressor activity biological binding regulation transcription results in protein coactivator results in regulation by activity interaction of regulates DNA protein transcription Hematopoiesis is precisely orchestrated by lineage-specific DNA-binding proteins that regulate transcription in concert with coactivators and corepressors. GO GO CHEBI SO relation 12 BP MF
  • 13.
  • 14.
    Compositional Annotation & Knowledge vertebrate pigmentatio n occurs_in denotes subClassOf text annotation 3 TAXON:7742 GO:0043474 basedO basedO Vertebrata pigmentation n n hasBody hasBody CRAFT PMID:1473718 text annotation 1 text annotation 2 3 hasTarget hasTarget 14
  • 15.
    Summary • Model thatcovers syntactic and semantic annotation – Linguistic annotation – Semantic annotation – Entity-based annotation • Capture complex content that is not necessarily best represented via a single URI – Created a GraphAnnotation that denotes a RDF named graph • Add kiao:basedOn to enable annotation compositions and provenance tracking – Annotation-level 15
  • 16.
    Acknowledgements University of Colorado: • National ICT Australia • Hunter Lab – Karin Verspoor – Larry Hunter – Mike Bada • Funding: – Bill Baumgartner – NIH/NLM training grant – Chris Roeder – Andrew W. Mellon Foundation – Kevin Cohen – Carsten Goerg 16
  • 17.
    Biomedical Annotation Kevin Livingston,Ph.D. Postdoctoral Fellow Pharmacology Department, School of Medicine University of Colorado Anschutz Medical Campus Kevin.Livingston@ucdenver.edu http://compbio.ucdenver.edu/Hunter_lab/Livingston

Editor's Notes

  • #5 Entity Centric
  • #12 Document Centric
  • #15 Rectangles are concepts we create, rounded rectangles are current ontological concepts. Orange objects are information content entities, blue objects are biomedical concepts.