Databases and Ontologies: Where do we go from here?


Published on

International Neuroinformatics Short Course on Neuroinformatics 2013; introduction to databases, ontologies and the Neuroscience Information Framework

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Databases and Ontologies: Where do we go from here?

  1. 1. Maryann  E.    Martone,  Ph.  D.   University  of  California,  San  Diego   INCF  Neuroinforma>cs  Short  Course,  Stockholm,  August  2013  
  2. 2. •  Introduc>on   •  Introduc>on  to  the  Neuroscience  Informa>on   Framework   •  Structured  informa>on:    data,  databases   •  Federa>ng  neuroscience-­‐relevant  databases   •  Informa>on  frameworks   •  Ontologies   •  What  can  we  do  with  informa>on  in  the  NIF?   •  Conclusions  
  3. 3. Scholar   Library   Scholar   Publisher    Future  of  research  communica>ons  and  e-­‐scholarship  
  4. 4. Scholar   Consumer   Libraries   Data  Repositories   Code  Repositories   Community  databases/ plaRorms   OA   Curators   Social   Networks   Social   Networks  Social   Networks   Peer  Reviewers   Narra>ve   Workflows   Data   Models   Mul>media   Nanopublica>ons   Code  
  5. 5. hTp://  
  6. 6. •  NIF’s  mission  is  to  maximize  the  awareness  of,  access  to   and  u>lity  of  research  resources  produced  worldwide  to   enable  beTer  science  and  promote  efficient  use   –  NIF  unites  neuroscience  informa>on  without  respect  to  domain,   funding  agency,  ins>tute  or  community   –  NIF  is  like  a  “Pub  Med”  for  all  biomedical  resources  and  a  “Pub   Med  Central”  for  databases   –  Makes  them  searchable  from  a  single  interface   –  Prac>cal  and  cost-­‐effec>ve;    tries  to  be  sensible   –  Learned  a  lot  about  the  effec0ve  data  sharing     The  Neuroscience  Informa>on  Framework  is  an  ini>a>ve  of  the   NIH  Blueprint  consor>um  of  ins>tutes        hTp://  
  7. 7. We’d  like  to  be  able  to  find:   •  What  is  known****:   –  What  are  the  projec>ons  of  hippocampus?   –  Is  GRM1  expressed  In  cerebral  cortex?   –  What  genes  have  been  found  to  be  upregulated  in   chronic  drug  abuse  in  adults   –  What  animal  models  have  similar  phenotypes  to   Parkinson’s  disease?   –  What  studies  used  my  polyclonal  an>body  against   GABA  in  humans?   •  What  is  not  known:   –  Connec>ons  among  data   –  Gaps  in  knowledge   A  framework  makes  it  easier  to  address  these  ques>ons  
  8. 8. Neuroscience  is  unlikely  to  be  served  by   a  few  large  databases  like  the  genomics   and  proteomics  community   Whole  brain  data   (20  um   microscopic  MRI)   Mosiac  LM   images  (1  GB+)   Conven>onal  LM   images   Individual  cell   morphologies   EM  volumes  &   reconstruc>ons   Solved  molecular   structures   No  single  technology  serves   these  all  equally  well.    Mul0ple  data  types;     mul0ple  scales;    mul0ple   databases  
  9. 9. •  Data  warehouse:    May  contain  data  from  diverse   sources;    schemas  are  integrated.    Data  are  “cleaned”   to  fit  unified  data  model.    One  database  to  rule  them   all...   •  Data  federa>on:    a  virtual  database  that  stores   data  defini>ons  and  not  the  data  itself.  The  virtual   database  will  have  informa>on  about  the  loca>on  of   the  data.    When  a  single  call  is  made  to  a  virtual   database,  the  technology  ensures  mul>ple  calls  to   underlying  databases  and  is  also  responsible  for   meaningfully  aggrega>ng  the  returned  result  sets.   From  wikipedia  and  hTp:// data_federa>on_a_potent_subst_1.html  
  10. 10. Subject    473   •  Species:    mouse  (string)   •  Age:    50  days  (integer)   •  Age  category:    adult   •  Protocol:    2   Rela0onal  Database   “Mice  (aged  50  days)  were  perfused  with   4%  paraformaldehyde  and  brains  were   sec>oned  at  a  thickness  of  50  um.     Sec>ons  were  labeled  using  an>bodies   against  calbindin  and  imaged  on  a  Zeiss     confocal  microscope.”       Data  model;    data  types,  formal  query   language   Free  text   En>ty  recogni>on;  Natural  language   processing  
  11. 11. ∞   What  is  easily  machine   processable  and  accessible   What  is  poten>ally  knowable   What  is  known:   Literature,  images,  human   knowledge   Unstructured;     Natural  language   processing,  en>ty   recogni>on,  image   processing  and   analysis;  paywalls   communica>on   Abstracts  vs  full   text  vs  tables  etc  
  12. 12. hGp://   June10,  2013   dkCOIN  Inves>gator's  Retreat   13   •  A  portal  for  finding  and  using   neuroscience  resources     A  consistent  framework  for   describing  resources     Provides  simultaneous   search  of  mul>ple  types  of   informa>on,  organized  by   category     Supported  by  an  expansive   ontology  for  neuroscience     U>lizes  advanced   technologies  to  search  the   “hidden  web”   UCSD,  Yale,  Cal  Tech,  George  Mason,  Washington  Univ   Literature   Database   Federa>on   Registry  
  13. 13. With  the  thousands  of  databases  and  other  informa>on  sources   available,  simple  descrip>ve  metadata  will  not  suffice  
  14. 14. • NIF  curators   • Nomina>on  by  the   community   • Semi-­‐automated  text   mining  pipelines    NIF  Registry    Requires  no  special   skills    Site  map  available   for  local  hos>ng   • NIF  Data  Federa>on   • DISCO  interop   • Requires  some   programming  skill   • Open  Source  Brain  <   2  hr   Low  barrier  to  entry;    incremental  refinement  
  15. 15. NIF  was  designed  to  be  populated  rapidly   with  progressive  refinement  
  16. 16. Databases  come  in  many  shapes  and  sizes   •  Primary  data:   –  Data  available  for  reanalysis,  e.g.,   microarray  data  sets  from  GEO;     brain  images  from  XNAT;     microscopic  images  (CCDB/CIL)   •  Secondary  data   –  Data  features  extracted  through   data  processing  and  some>mes   normaliza>on,  e.g,  brain  structure   volumes  (IBVD),  gene  expression   levels  (Allen  Brain  Atlas);    brain   connec>vity  statements  (BAMS)   •  Ter>ary  data   –  Claims  and  asser>ons  about  the   meaning  of  data   •  E.g.,  gene  upregula>on/ downregula>on,  brain   ac>va>on  as  a  func>on  of  task   •  Registries:   –  Metadata   –  Pointers  to  data  sets  or   materials  stored  elsewhere   •  Data  aggregators   –  Aggregate  data  of  the  same   type  from  mul>ple  sources,   e.g.,  Cell  Image   Library  ,SUMSdb,  Brede   •  Single  source   –  Data  acquired  within  a  single   context  ,  e.g.,  Allen  Brain  Atlas   Researchers  are  producing  a  variety  of   informa>on  ar>facts  using  a  mul>tude  of   technologies  
  17. 17. • Data:  values  of  qualita>ve  or  quan>ta>ve  variables,  belonging  to  a  set  of  items...  oten   the  results  of  measurements  (Wikipedia)   • Metadata:    “Data  about  data”   • Structural  metadata:   • the  design  and  specifica>on  of  data  structures  and  is  more  properly  called   "data  about  the  containers  of  data”  (Wikipedia)   • e.g.,  image  size,  bit  depth,  integer  vs  string   • Descrip>ve  metadata:       • individual  instances  of  applica>on  data,  the  data  content  “data  about  data   content”   • e.g.,  creator,  subject,     • Data  type:    the  form  of  the  data  for  the  purposes  of  data  opera>ons   • Data  Integra>on:  combining  data  residing  in  different  sources  and  providing  users   with  a  unified  view  of  these  data   “Metadata  are  data”  -­‐Wikipedia  
  18. 18. 0   50   100   150   200   250   0.01   0.1   1   10   100   1000   6-­‐12   12-­‐12   7-­‐13   1-­‐14   8-­‐14   2-­‐15   9-­‐15   4-­‐16   10-­‐16   5-­‐17   Number  of  Federated  Databases   Number  of  Federated  Records  (Millions)   NIF  searches  the  largest  colla>on  of   neuroscience-­‐relevant  data  on  the  web   DISCO   June10,  2013   dkCOIN  Inves>gator's  Retreat   20  
  19. 19. •  Long  tail  data:    large  numbers  of  small  data  sets   hTp://  
  20. 20. Hippocampus  OR  “Cornu  Ammonis”  OR   “Ammon’s  horn”   Query  expansion:    Synonyms   and  related  concepts   Boolean  queries   Data  sources   categorized  by   “data  type”  and   level  of  nervous   system   Common  views   across  mul>ple   sources   Tutorials  for  using   full  resource  when   ge{ng  there  from   NIF   Link  back  to   record  in   original  source  
  21. 21. Connects  to   Synapsed  with   Synapsed  by   Input  region   innervates   Axon  innervates   Projects  to  Cellular  contact   Subcellular  contact   Source  site   Target    site   Each  resource  implements  a  different,  though  related  model;     systems  are  complex  and  difficult  to  learn,  in  many  cases  
  22. 22. •  Current  web  is   designed  to  share   documents   –  Documents  are   unstructured  data   •  Much  of  the  content   of  digital  resources  is   part  of  the  “hidden   web”   •  Wikipedia:    The  Deep  Web   (also  called  Deepnet,  the   invisible  Web,  DarkNet,   Undernet  or  the  hidden   Web)  refers  to   World  Wide  Web  content   that  is  not  part  of  the   Surface  Web,  which  is   indexed  by  standard   search  engines.  
  23. 23. Even  Google  needs  a  knowledge  framework  
  24. 24. Knowledge  in  space  and  spa>al  rela>onships   (the  “where”)   Knowledge  in  words,  terminologies  and   logical  rela>onships  (the  “what”)  
  25. 25. Purkinje   Cell   Axon   Terminal   Axon   Dendri>c   Tree   Dendri>c   Spine   Dendrite   Cell  body   Cerebellar   cortex   There  is  liTle  obvious  connec>on  between   data  sets  taken  at  different  scales  using   different  microscopies  without  an  explicit   representa>on  of  the  biological  objects  that   the  data  represent  
  26. 26. •  NIF  covers  mul>ple  structural  scales  and  domains  of  relevance  to  neuroscience   •  Aggregate  of  community  ontologies  with  some  extensions  for  neuroscience,  e.g.,  Gene   Ontology,  Chebi,  Protein  Ontology   NIFSTD   Organism   NS  Func>on  Molecule   Inves>ga>on   Subcellular   structure   Macromolecule   Gene   Molecule  Descriptors   Techniques   Reagent   Protocols   Cell   Resource   Instrument   Dysfunc>on   Quality   Anatomical   Structure  
  27. 27. Brain   Cerebellum   Purkinje  Cell  Layer   Purkinje  cell   neuron   has  a   has  a   has  a   is  a   •  Ontology:  an  explicit,  formal   representa>on  of  concepts     rela>onships  among  them  within   a  par>cular  domain  that   expresses  human  knowledge  in  a   machine  readable  form   •  Branch  of  philosophy:    a  theory   of  what  is   •  e.g.,  Gene  ontologies  
  28. 28. •  Express  neuroscience  concepts  in  a  way  that  is  machine  readable     –  Synonyms,  lexical  variants   –  Defini>ons   •  Provide  means  of  disambigua>on  of  strings   –  Nucleus  part  of  cell;    nucleus  part  of  brain;    nucleus  part  of  atom   •  Rules  by  which  a  class  is  defined,  e.g.,  a  GABAergic  neuron  is  neuron  that  releases  GABA  as  a   neurotransmiTer   •  Proper>es   –  Support  reasoning   •  Provide  universals  for  naviga>ng  across  different  data  sources   –  Seman>c  “index”   –  Link  data  through  rela>onships  not  just  one-­‐to-­‐one  mappings   •  Provide  the  basis  for  concept-­‐based  queries  to  probe  and  mine  data   •  Establish  a  seman>c  framework  for  landscape  analysis   Mathema>cs,  Computer  code  or  Esperanto  
  29. 29. June10,  2013   32   Aligns  sources  to  the  NIF  seman>c  framework  
  30. 30. birnlex_1741   Brodmann.10   Explicit  mapping  of  database  content  helps  disambiguate  non-­‐unique  and  custom   terminology  
  31. 31. birnlex_1204   CA3  
  32. 32. •  Search  Google:    GABAergic  neuron   •  Search  NIF:    GABAergic  neuron   –  NIF  automa>cally  searches  for  types  of   GABAergic  neurons   Types  of  GABAergic   neurons   Neuroscience Information Framework –
  33. 33. Equivalence  classes;    restric>ons   Arbitrary  but  defensible   • Neurons  classified  by   • Circuit  role:    principal  neuron  vs   interneuron   • Molecular  cons>tuent:    Parvalbumin-­‐ neurons,  calbindin-­‐neurons   • Brain  region:    Cerebellar  neuron   • Morphology:    Spiny  neuron   •   Molecule  Roles:    Drug  of  abuse,  anterograde   tracer,  retrograde  tracer   • Brain  parts:    Circumventricular  organ   • Organisms:    Non-­‐human  primate,  non-­‐human   vertebrate   • Quali>es:    Expression  level   • Techniques:    Neuroimaging  
  34. 34. What  genes  are  upregulated  by  drugs  of  abuse  in  the   adult  mouse?  (show  me  the  data!)   Morphine   Increased   expression   Adult  Mouse  
  35. 35. • NIF  Connec>vity:    7  databases  containing  connec>vity  primary  data  or  claims   from  literature  on  connec>vity  between  brain  regions   • Brain  Architecture  Management  System  (rodent)   • Temporal  (rodent)   • Connectome  Wiki  (human)   • Brain  Maps  (various)   • CoCoMac  (primate  cortex)   • UCLA  Mul>modal  database  (Human  fMRI)   • Avian  Brain  Connec>vity  Database  (Bird)   • Total:    1800  unique  brain  terms  (excluding  Avian)   • Number  of  exact  terms  used  in  >  1  database:    42   • Number  of  synonym  matches:    99   • Number  of  1st  order  partonomy  matches:    385  
  36. 36. •  Realism  vs  conceptualism   •  Controlled  vocabularies  vs  taxonomies  vs  ontology?   •  How  do  I  name  classes?   •  Shared  vs  custom  ontologies   •  Single  vs  mul>ple  inheritance   •  RDF  vs  OWL?   •  Top  down  vs  boTom  up:    heavy  weight  vs  light   weight  ontologies   •  Should  I  encode  everything  in  my  ontology?   Many  schools  of  thought  about  ontologies-­‐their  construc>on   and  use  
  37. 37. •  Controlled  vocabularies:  prescribed   list  of  terms  or  headings  each  one  having   an  assigned  meaning   •  Lexicon/Thesaurus:  Vocabularies  +   their  lexical  proper>es,  e.g.,  synonyms,   lexical  variants   •  Taxonomy:    monohierarchical   classifica>on  of  concepts,  as  used,  for   example,  in  the  classifica>on  of  biological   organisms,  built  on  the  “is  a  “  rela>onship   •   Ontology:    specifica>on  of  the  concepts   of  a  domain  and  their  rela>onships,   structured  to  allow  computer  processing   and  reasoning     hTp://   Mike  Bergman  
  38. 38. •  Iden>ty:   –  En>>es  are  uniquely  iden>fiable   –  Name  is  a  meaningless  numerical  iden>fier  (URI:    Uniform  resource  iden>fier)   –  Any  number  of  human  readable  labels  can  be  assigned  to  it   •  Defini>on:       –  Genera:    is  a  type  of  (cell,  anatomical  structure,  cell  part)   –  Differen>a:    “has  a”  A  set  of  proper>es  that  dis>nguish  among  members  of  that   class   –  Can  include  necessary  and  sufficient  condi>ons   •  Implementa>on:    How  is  this  defini>on  expressed   –  Depending  on  the  nature  of  the  concept  or  en>ty  and  the  needs  of  the   informa>on  system,  we  can  say  more  or  fewer  things   –  Different  languages;    can  express  different  things  about  the  concept  that  can  be   computed  upon   •  OWL  W3C  standard,  RDF   birnlex_1362   CA2   CHEBI_29108   CA2   NIF  follows  OBO  Foundry  best  prac>ces  for  naming  and  defining   classes  
  39. 39. •  XML:    Extensible  Mark  Up  language:      Mark  up  language  for  data.    XML  itself  is  not  very   much  concerned  with  meaning.  XML  nodes  don't  need  to  be  associated  with  par>cular   concepts,  and  the  XML  standard  doesn't  indicate  how  to  derive  a  fact  from  a  document.   •  RDF:    Resource  Descrip>on  Framework:    a  general  method  to  decompose  knowledge  into   small  pieces,  with  some  rules  about  the  seman>cs,  or  meaning,  of  those  pieces.  What  sets   RDF  apart  from  XML  is  that  RDF  is  designed  to  represent  knowledge  in  a  distributed  world.   That  RDF  is  designed  for  knowledge,  and  not  data,  means  RDF  is  par>cularly  concerned   with  meaning.   –  Small  pieces  are  called  “triples”:    Subject  predicate  object   –  Purkinje  neuron  (S)  has  neurotransmiDer  (P)  GABA  (O)   •  RDFS  -­‐  a  method  of  specifying  metadata  about  proper>es/characteris>cs  of  things  and   classes  of  things  such  that  inference  an  be  carried  out  (conceptualized  in  RDF)   •  OWL  (Web  Ontology  Language)  -­‐  a  more  complex(/powerful)  extension  of  RDFS   •  SPARQL  -­‐  Is  a  query  language  designed  for  RDF  (similar  to  how  SQL  was  designed  for   rela>onal  databases)   hTp://answers.seman>>ons/15215/whats-­‐the-­‐difference-­‐between-­‐using-­‐rdfsowl-­‐ versus-­‐xml   hTp://  
  40. 40. Rela>onal  model   • Mouse  has  age  50  days   • Protocol  uses  instrument  confocal   microscope   • A  confocal  imaging  protocol  is  a  protocol   that  uses  instrument  confocal  microscope   RDF:    The  computer  doesn't  need  to  know  what   has  actually  means  in  English  for  this  to  be  useful.   It  is  let  up  to  the  applica>on  writer  to  choose   appropriate  names  for  things  (confocal   microscope)  and  to  use  the  right  predicates  (uses,   has).  RDF  tools  are  ignorant  of  what  these  names   mean,  but  they  can  s>ll  usefully  process  the   informa>on.-­‐hTp:// #Introducing%20RDF   May  link  to  other  informa>on,  e.g.,  mouse  is   a  rodent  
  41. 41. The  thalamus  projects  to  the  cortex  in  mammals   •  Universal:  allValuesFrom:    If  a  mammal  has  a  cortex  and  a   thalamus,  then  the  thalamus  must  project  to  the  cortex   •  Existen>al:    SomeValuesFrom:    The  thalamus  projects  to   the  cortex  in  at  least  one  member  of  the  class  mammal   •  Disjointness:    owl:disjointWith:  a  member  of  one  class   cannot  simultaneously  be  an  instance  of  a  specified  other   class:    Rep>les  are  disjoint  from  mammals   W3C  OWL  guide:­‐owl-­‐guide-­‐20040210/   Restric>ons  places  on  classes  allow  us  to  reason   over  the  ontology  and  check  for  consistency  
  42. 42. 46  
  43. 43. 1.  Look  brain  region  up  in  NeuroLex   2.  Look  up  cells  contained  in  the  brain   region   3.  Find  those  cells  that  are  known  to  project   out  of  that  brain  region   4.  Look  up  the  neurotransmiTers  for  those   cells   5.  Determine  whether  those   neurotransmiTers  are  known  to  be   excitatory  or  inhibitory   6.  Report  the  projec>on  as  excitatory  or   inhibitory,  and  report  the  en>re  chain  of   logic  with  links  back  to  the  wiki  pages   where  they  were  made   7.  Make  sure  user  can  get  back  to  each   statement  in  the  logic  chain  to  edit  it  if   they  think  it  is  wrong   Stephen  Larson   CHEBI:18243  
  44. 44. Brain   Cerebellum   Cortex   Cerebellar  Purkinje   cell   Purkinje  neuron   Purkinje  cell   soma   Purkinje  cell   layer     Cerebellar   cortex   IP3   Cerebellum   • To  create  the   linkages  requires   mapping   • Mapping  is   usually  incomplete   and  not  always   possible   • Can’t  take   advantage  of   others’  work   Gross  anatomy  ontology   Cell  centered  anatomy  ontology   Reuse  iden>fiers  rather  than  recreate  them  
  45. 45. •  “The  trouble  is  that  if  I  make  up  all  of  my   own  URIs,  my  RDF  document  has  no   meaning  to  anyone  else  unless  I  explain   what  each  URI  is  intended  to  denote  or   mean.  Two  RDF  documents  with  no  URIs  in   common  have  no  informa>on  that  can  be   interrelated.”   •  NIF  favors  reuse  of  iden>fiers  rather  than   mapping   •  Crea>ng  ontologies  to  be  used  as  common   building  blocks:  modularity,  low  seman>c   overhead,  is  important   hTp://  
  46. 46. Cerebellum   Purkinje  cell  soma   Cerebellum   Purkinje  cell   dendrite   Cerebellum   Purkinje  cell  axon   (Cell  part   ontology)   Cerebellum  granule  cell   layer    (Anatomy  ontology)   Cerebellum  Purkinje   cell  layer   Cerebellum   molecular  layer   Has   part   Has   part   Has   part   Is  part  of   Is  part  of   Is  part  of   Calbindin   IP3   (CHEBI:16595)   Cerebellum   Purkinje  neuron   (Cell  Ontology)   Cerebellar  cortex   Has  part   Has  part   Has  part  
  47. 47. •  Neuroscience  Informa>on  Framework   –  NIFSTD  available  for  download   –  Ontoquest  web  services   –  NIF  annota>on  services  and  mapping    tools   available   –  Neurolex  available  via  SPARQL  endpoint   •  Bioportal:    Collec>on  of  >  300  ontologies   covering  many  domains   –  automated  mapping  between  ontologies   –  Annota>on  services   –  Web  services  for  access   •  OBO  Foundry:    hTp://   –  Collec>on  of  community  ontologies  designed   according  to  OBO  Foundry  principles   •  Protégé  Ontology  editor:    Edi>ng  tool  for   construc>ng  ontologies.    Excellent  short  course   available  for  Protégé/OWL.   •  Program  on  Ontologies  of  Neural  Structures   (INCF):    CUMBO,  Neurolex  Wiki,  Scalable  Brain   Atlas   You  can  enhance  your  tools  and  annota>on  with  community   ontologies  
  48. 48. hTp://   Larson  et  al,  Fron>ers  in  Neuroinforma>cs,  in  press   • Seman>c  MediWiki   • Provide  a  simple  interface   for  defining  the  concepts   required   • Light  weight  seman>cs   • Good  teaching  tool  for   learning  about  seman>c   integra>on  and  the  benefits  of   a  consistent  seman>c   framework   • Community  based:   • Anyone  can  contribute  their   terms,  concepts,  things   • Anyone  can  edit   • Anyone  can  link   • Accessible:    searched  by  Google   • Growing  into  a  significant   knowledge  base  for   neuroscience   Demo    D03    200,000   edits    150   contributors  
  49. 49. Red  Links:    Informa>on  is  missing  (or  misspelled)  
  50. 50. •  Neurolex  provides  an   on-­‐line  computable   index  for  expressing   models  in  seman>c   terms,  and  linking  to   other  knowledge  and   data   •  INCF  task  forces  are   contribu>ng   knowledge   •  Neuroscience   knowledge  in  the  web   Builds  a  knowledge  base  by  cross-­‐modular  rela>ons   and  links  to  data  
  51. 51. Once  terms  have  been  proposed  and  veTed  by   neuroscience  community,  NIF  feeds  them  back  to  general   ontologies  to  enrich  coverage  of  neuroscience  
  52. 52. Because  they  are  sta>c  URL’s,  Wikis  are  searchable  by   Google  
  53. 53. •  INCF  Project   –  Neuron  Registry   –  >  30  experts   worldwide   –  Fill  out  neuron   pages  in  Neurolex   Wiki   –  Led  by  Dr.  Gordon   Shepherd   Soma  loca>on   Dendrite  loca>on   Axon  loca>on   0   50   100   150   200   250   300   Number   Total   redlinks   easy   fixes   hard   fixes   Soma  loca>on   Dendrite  loca>on   Axon  loca>on   Social  networks  and  community  sites  let  us  learn  things  from  the   collec>ve  behavior  of  contributors    INCF  Knowledge  Space  
  54. 54. •  Of  the  ~  4000  columns   that  NIF  queries,   ~1300  map  to  one  of   our  core  categories:   –  Organism   –  Anatomical  structure   –  Cell   –  Molecule   –  Func>on   –  Dysfunc>on   –  Technique   •  30-­‐50%  of  NIF’s   queries  autocomplete   •  When  NIF  combines   mul>ple  sources,  a  set   of  common  fields   emerges   –  >Basic  informa>on   models/seman>c   models  exist  for   certain  types  of   en>>es   Biomedical  science  does  have  a  conceptual  framework;    but  we  don’t  place   undo  importance  on  it    must  >e  to  data  
  55. 55. •  NIF  can  be  used  to  survey  the   data  landscape   •  Analysis  of  NIF  shows  mul>ple   databases  with  similar  scope   and  content   •  Many  contain  par>ally   overlapping  data   •  Data  “flows”  from  one   resource  to  the  next   –  Data  is  reinterpreted,  reanalyzed  or   added  to   •  Is  duplica>on  good  or  bad?   NIF  is  trying  to  make  it  easier  to  work  with  diverse  data  
  56. 56. NIF  is  in  a  unique  posi>on  to  answer  ques>ons  about  the  neuroscience   landscape   Where  are  the  data?   Striatum   Hypothalamus   Olfactory  bulb   Cerebral  cortex   Brain   Brain  region   Data  source  
  57. 57. ∞   What  is  easily  machine   processable  and  accessible   What  is  poten>ally  knowable   What  is  known:   Literature,  images,  human   knowledge   Unstructured;     Natural  language   processing,  en>ty   recogni>on,  image   processing  and   analysis;     communica>on   “Known  unknowns  vs   unknown  unknowns”   Open  world  meets  closed  world  
  58. 58. Comprehensive  and  unbiased?   We  know  a  lot  about  some  things  and  less  about  others;    some   of  NIF’s  sources  are  comprehensive;    others  are  highly  biased   But...NIF  has  >  2M  an>bodies,   338,000  model  organisms,  and  3   million  microarray  records  
  59. 59. Neocortex   Olfactory  bulb   Neostriatum   Cochlear  nucleus   All  neurons  with  cell  bodies  in  the  same  brain  region  are  grouped   together   Proper>es  in  Neurolex  
  60. 60. NIF  is  in  a  unique  posi>on  to  answer  ques>ons  about  the  neuroscience   landscape   Where  are  the  data?   Striatum   Hypothalamus   Olfactory  bulb   Cerebral  cortex   Brain   Brain  region   Data  source   Funding  
  61. 61. • Requires  account  in  MyNIF   • S>ll  a  work  in  progress,  i.e.,  it  breaks  a  lot   • If  you  are  interested,  contact  us!   Vadim  Astakhov,  Kepler  Workflow  Engine  
  62. 62. •  Gemma:    Gene  ID    +  Gene  Symbol   •  DRG:    Gene  name  +  Probe  ID   •  Gemma  presented  results  rela>ve  to  baseline  chronic   morphine;    DRG  with  respect  to  saline,  so  direc>on  of  change  is   opposite  in  the  2  databases   •           Analysis:   • 1370  statements  from  Gemma  regarding  gene  expression  as  a  func>on  of  chronic   morphine   • 617  were  consistent  with  DRG;      over  half    of  the  claims  of  the  paper  were  not   confirmed  in  this  analysis   • Results  for  1  gene  were  opposite  in  DRG  and  Gemma   • 45  did  not  have  enough  informa>on  provided  in  the  paper  to  make  a  judgment   Rela>vely  simple  standards  would  make  life  easier  
  63. 63. 47/50  major  preclinical   published  cancer  studies   could  not  be  replicated   •  “The  scien>fic  community   assumes  that  the  claims  in  a   preclinical  study  can  be  taken  at   face  value-­‐that  although  there   might  be  some  errors  in  detail,   the  main  message  of  the  paper   can  be  relied  on  and  the  data   will,  for  the  most  part,  stand  the   test  of  >me.    Unfortunately,  this   is  not  always  the  case.”     •  Ge{ng  data  out  sooner  in  a   form  where  they  can  be   exposed  to  many  eyes  and   many  analyses  may  allow  us   to  expose  errors  and  develop   beTer  metrics  to  evaluate  the   validity  of  data   Begley  and  Ellis,  29  MARCH  2012  |  VOL  483  |   NATURE  |  531  
  64. 64. NIF  favors  a  hybrid,  >ered,   federated  system   •  Domain  knowledge   –  Ontologies   •  Claims,  models  and   observa>ons   –  Virtuoso  RDF  triples     –  Model  repositories   •  Data   –  Data  federa>on   –  Spa>al  data   –  Workflows   •  Narra>ve   –  Full  text  access   Neuron   Brain  part   Disease   Organism   Gene   Caudate  projects  to   Snpc   Grm1  is  upregulated  in   chronic  cocaine   Betz  cells   degenerate  in  ALS   NIF  provides  the  tentacles  that  connect  the  pieces:    a   new  type  of  en>ty  for  21st  century  science   Technique   People  
  65. 65. •  Several  powerful  trends  should  change  the  way  we  think  about   our  data:    One    Many   –  Many  data   •  Genera>on  of  data  is  ge{ng  easier    shared  data   •  Data  space  is  ge{ng  richer:    more  –omes  everyday   •  But...compared  to  the  biological  space,  s>ll  sparse   –  Many  eyes   •  Wisdom  of  crowds   •  More  than  one  way  to  interpret  data   –  Many  algorithms   •  Not  a  single  way  to  analyze  data   –  Many  analy>cs   •  “Signatures”  in  data  may  not  be  directly  related  to  the  ques>on  for  which  they   were  acquired  but  tell  us  something  really  interes>ng   Are  you  exposing  or  burying  your  work?  
  66. 66. •  You  (and  the  machine)  have  to  be  able  to  find  it   –  Accessible  through  the  web   –  Structured  or  semi-­‐structured   –  Annota>ons   •  You  (and  the  machine)    have  to  be  able  to  use  it   –  Data  type  specified  and  in  an  ac>onable  form   •  You  (and  the  machine)  have  to  know  what  the  data   mean   •  Seman>cs   •  Context:    Experimental  metadata   •  Provenance:    where  did  they  come  from   Repor>ng  neuroscience  data  within  a  consistent  framework  helps   enormously,  but  the  frameworks  need  not  be  onerous  
  67. 67. A  data  sharing  snafu  in  3  acts  
  68. 68. hTp://  
  69. 69. Jeff  Grethe,  UCSD,  Co  Inves>gator,  Interim  PI   Amarnath  Gupta,  UCSD,  Co  Inves>gator   Anita  Bandrowski,  NIF  Project  Leader   Gordon  Shepherd,  Yale  University   Perry  Miller   Luis  Marenco   Rixin  Wang   David  Van  Essen,  Washington  University   Erin  Reid   Paul  Sternberg,  Cal  Tech   Arun  Rangarajan   Hans  Michael  Muller   Yuling  Li   Giorgio  Ascoli,  George  Mason  University   Sridevi  Polavarum   Fahim  Imam   Larry  Lui   Andrea  Arnaud  Stagg   Jonathan  Cachat   Jennifer  Lawrence   Svetlana  Sulima   Davis  Banks   Vadim  Astakhov   Xufei  Qian   Chris  Condit   Mark  Ellisman   Stephen  Larson   Willie  Wong   Tim  Clark,  Harvard  University   Paolo  Ciccarese   Karen  Skinner,  NIH,  Program  Officer   (re>red)   Jonathan  Pollock,  NIH,  Program  Officer   And  my  colleagues  in  Monarch,  dkNet,  3DVC,  Force  11