Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences
Upcoming SlideShare
Loading in...5

Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences



Maryann Martone

Maryann Martone
Earth Cube Summer Institute, San Diego Supercomputer Center
August 12, 2013



Total Views
Views on SlideShare
Embed Views



5 Embeds 584 387 185 10 1 1


Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences Presentation Transcript

  • Maryann  E.    Martone,  Ph.  D.   University  of  California,  San  Diego  
  • “A  grand  challenge  in  neuroscience  is  to  elucidate  brain  func>on  in  rela>on  to   its  mul>ple  layers  of  organiza>on  that  operate  at  different  spa>al  and   temporal  scales.    Central  to  this  effort  is  tackling  “neural  choreography”  -­‐-­‐   the  integrated  func>oning  of  neurons  into  brain  circuits-­‐-­‐  Neural   choreography  cannot  be  understood  via  a  purely  reduc>onist  approach.   Rather,  it  entails  the  convergent  use  of  analy>cal  and  synthe>c  tools  to   gather,  analyze  and  mine  informa>on  from  each  level  of  analysis,  and   capture  the  emergence  of  new  layers  of  func>on  (or  dysfunc>on)  as  we   move  from  studying  genes  and  proteins,  to  cells,  circuits,  thought,  and   behavior....     However,  the  neuroscience  community  is  not  yet  fully  engaged  in  exploi;ng   the  rich  array  of  data  currently  available,  nor  is  it  adequately  poised  to   capitalize  on  the  forthcoming  data  explosion.    “   Akil  et  al.,  Science,  Feb  11,  2011  
  • •  In  that  same  issue  of  Science   –  Asked  peer  reviewers  from  last  year  about  the  availability  and  use  of   data   •  About  half  of  those  polled  store  their  data  only  in  their   laboratories—not  an  ideal  long-­‐term  solu>on.     •  Many  bemoaned  the  lack  of  common  metadata  and   archives  as  a  main  impediment  to  using  and  storing   data,  and  most  of  the  respondents  have  no  funding  to   support  archiving   •  And  even  where  accessible,  much  data  in  many  fields  is   too  poorly  organized  to  enable  it  to  be  efficiently  used.   “  is  a  growing  challenge  to  ensure  that  data  produced  during  the  course   of  reported  research  are  appropriately  described,  standardized,  archived,   and  available  to  all.”    Lead  Science  editorial,  2011   View slide
  • Neuroscience  is  unlikely  to  be   served  by  a  few  large  databases   like  the  genomics  and  proteomics   community  Whole  brain  data   (20  um   microscopic  MRI)   Mosiac  LM   images  (1  GB+)   Conven>onal  LM   images   Individual  cell   morphologies   EM  volumes  &   reconstruc>ons   Solved  molecular   structures   No  single  technology  serves  these  all   equally  well.    Mul6ple  data  types;    mul6ple   scales;    mul6ple  databases   View slide
  • hZp://  
  • •  Current  web  is   designed  to  share   documents   – Documents  are   unstructured  data   •  Much  of  the   content  of  digital   resources  is  part  of   the  “hidden  web”   •  Wikipedia:    The  Deep  Web   (also  called  Deepnet,  the   invisible  Web,  DarkNet,   Undernet  or  the  hidden   Web)  refers  to   World  Wide  Web  content   that  is  not  part  of  the   Surface  Web,  which  is   indexed  by  standard   search  engines.  
  • •  NIF  has  developed  a   produc>on  technology   pla]orm  for  researchers  to:   –  Discover   –  Share   –  Analyze   –  Integrate     neuroscience-­‐relevant   informa>on   •  Since  2008,  NIF  has   assembled  the  largest   searchable  catalog  of   neuroscience  data  and   resources  on  the  web   •  Cost-­‐effec>ve  and   innova>ve  strategy  for   managing  data  assets   “This  unique  data  depository  serves  as  a  model   for  other  Web  sites  to  provide  research  data.  “  -­‐   Choice  Reviews  Online   NIF  is  poised  to  capitalize  on  the  new  tools   and  emphasis  on  big  data  and  open   science  
  • h?p://   June10,  2013   dkCOIN  Inves>gator's  Retreat   8   •  A  portal  for  finding  and  using   neuroscience  resources     A  consistent  framework  for   describing  resources     Provides  simultaneous   search  of  mul>ple  types  of   informa>on,  organized  by   category     Supported  by  an  expansive   ontology  for  neuroscience     U>lizes  advanced   technologies  to  search  the   “hidden  web”   UCSD,  Yale,  Cal  Tech,  George  Mason,  Washington  Univ   Literature   Database   Federa>on   Registry  
  • • NIF  Registry:    A  catalog   of  neuroscience-­‐ relevant  resources   • >  6000  currently   listed   • >  2200  databases   • And  we  are  finding   more  every  day   “Of  relevance  to  neuroscience”  is  very  broad  
  • dkCOIN  Inves>gator's  Retreat   10   • NIF  curators   • Nomina>on  by  the   community   • Semi-­‐automated  text  mining   pipelines    NIF  Registry    Requires  no  special   skills    Site  map  available  for   local  hos>ng   • NIF  Data  Federa>on   • DISCO  interop   • Requires  some   programming  skill   Low  barrier  to  entry  
  • •  Extended  over  >me   –  Parent  resource   –  Suppor>ng  agency   –  Grant  numbers   –  Accessibility   –  Related  to   –  Organism   –  Disease  or  condi>on   –  Last  updated   First  catalog:    SFN  Neuroscience  Database  Gateway    NIF  0.5    NIF  1.0+   Simple  metadata  model   Name,  descrip>on,  type,  URL,  other  names,  keywords,   unique  iden>fier                                                                              ~2003                                                                  2006                          2008  
  • 12   •  NIF  Registry  is  hosted   on  Seman>c  Media   Wiki  pla]orm   Neurolex   –  Community  can  add,   review,  edit  without   special  privileges   –  Searchable  by  Google   –  Integrated  with  NIF   ontologies   –  Graph  structure   Seman>c  wiki:    A  wiki  with  seman>cs;    pages  are  linked  through  rela>onships  
  • NIF  is  crea>ng  the  linked  data  graph  of  resources  
  • –  NIF  employs  an  automated  link  checker     –  Last  analysis:    478/6100  invalid  URL’s  (~8%)   –  199  can’t  locate  at  another  university  or  loca>on    out  of  service  (~3%)   –  Bigger  issue:    Many  resources  are  no  longer  updated  or  maintained   0   20   40   60   80   100   120   140   160   180   200   1996   1998   2000   2002   2004   2006   2008   2010   2012   2014   0   500   1000   1500   2000   2500   3000   3500   Resources  added   Last  updated  
  • Keeping  content  up   to  date   Connectome   Tractography   Epigene>cs   • New  tags  come  into   existence   • New  resource  types  come   into  existence,  e.g.,  Mobile   apps   • Resources  add  new  types  of   content     • Change  name   • Change  scope   • >  7000  updates  to  the   registry  last  year   It’s  a  challenge  to  keep  the  registry  up  to  date;     sitemaps,  cura>on,  ontologies,  community  review  
  • • The  NIF  Registry  has  created  a  linked  data   graph  of  web-­‐accessible  resources   • Maintained  on  a  community  wiki   pla]orm   • Provides  data  on  the  fluidity  of  the   resource  landscape   –  New  resources  con>nue  to  be  created  and   found   –  Rela>vely  few  disappear  altogether   –  Many  more  grow  stale,  although  their  value   may  s>ll  be  significant   –  Maintaining  up  to  date  cura>on  requires   frequent  upda>ng   NIF  Registry  provides  insight  into  the  state  of  digital   resources  on  the  web  
  • • The  NIF  data  federa>on  performs  deep  search  over   the  content  of  over  200  databases   • New  databases  are  added  at  a  rate  of  25-­‐40  per  year   • Latest  update:    Open  Source  Brain;    ingest   completed  in  2  hours   • Databases  chosen  on  a  variety  of  criteria:   • Early:    tes>ng  different  types  of  resources   • Thema>c  areas   • Volunteers   NIF  provides  access  to  the  largest  aggrega>on  of   neuroscience-­‐relevant  informa>on  on  the  web  
  • •  NIF  was  one  of  the  first  projects  to  aZempt  data  integra>on   in  the  neurosciences  on  a  large  scale   •  NIF  is  supported  by  a  contract  that  specified  the  number  of   resources  to  be  added  per  year     –  Designed  to  be  populated  rapidly;    set  up  process  for  progressive   refinement   –  No  budget  was  allocated  to  retrofit  exis>ng  resources;    had  to   work  with  them  in  their  current  state   –  We  designed  a  system  that  required  liZle  to  no  coopera>on  or   work  from  providers   –  Supports  many  formats:    rela>onal,  XML,  RDF  
  • Current   Planned   DISCO  Dashboard  Func6ons   •  Ingest  Script  Manager   •  Public  Script  Repository   •  Data  &  Event  Tracker   •  Versioning  System   •  Curator  Tool     •  Data  Transformer  Manager   June10,  2013   dkCOIN  Inves>gator's  Retreat   19  Luis  Marenco,  Rixin  Wang,  Perrry  Miller,  Gordon  Shepherd   Yale  University  
  • 0   50   100   150   200   250   0.01   0.1   1   10   100   1000   6-­‐12   12-­‐12   7-­‐13   1-­‐14   8-­‐14   2-­‐15   9-­‐15   4-­‐16   10-­‐16   5-­‐17   Number  of  Federated  Databases   Number  of  Federated  Records  (Millions)   NIF  searches  the  largest  colla>on  of   neuroscience-­‐relevant  data  on  the  web   DISCO   June10,  2013   dkCOIN  Inves>gator's  Retreat   20  
  • Results  categorized  by  data  type  and  level   of  nervous  system    
  • Hippocampus  OR  “Cornu  Ammonis”  OR   “Ammon’s  horn”   Query  expansion:    Synonyms   and  related  concepts   Boolean  queries   Data  sources   categorized  by   “data  type”  and   level  of  nervous   system   Common  views   across  mul>ple   sources   Tutorials  for  using   full  resource  when   gewng  there  from   NIF   Link  back  to   record  in   original  source  
  • Connects  to   Synapsed  with   Synapsed  by   Input  region   innervates   Axon  innervates   Projects  to  Cellular  contact   Subcellular  contact   Source  site   Target    site   Each  resource  implements  a  different,  though  related  model;     systems  are  complex  and  difficult  to  learn,  in  many  cases  
  • • NIF  Connec>vity:    7  databases  containing  connec>vity  primary  data  or  claims   from  literature  on  connec>vity  between  brain  regions   • Brain  Architecture  Management  System  (rodent)   • Temporal  (rodent)   • Connectome  Wiki  (human)   • Brain  Maps  (various)   • CoCoMac  (primate  cortex)   • UCLA  Mul>modal  database  (Human  fMRI)   • Avian  Brain  Connec>vity  Database  (Bird)   • Total:    1800  unique  brain  terms  (excluding  Avian)   • Number  of  exact  terms  used  in  >  1  database:    42   • Number  of  synonym  matches:    99   • Number  of  1st  order  partonomy  matches:    385  
  • – You  (and  the  machine)  have  to  be  able  to   find  it   •  Accessible  through  the  web   •  Annota>ons   – You  have  to  be  able  to  access  and  use  it   •  Data  type  specified  and  in  a  usable  form   – You  have  to  know  what  the  data  mean   •  Some  seman>cs:    “1”   •  Context:    Experimental  metadata   •  Provenance:    Where  did  the  data  come  from?   Repor>ng  neuroscience  data  within  a  consistent  framework  helps   enormously  
  • Knowledge  in  space  and  spa>al  rela>onships   (the  “where”)   Knowledge  in  words,  terminologies  and   logical  rela>onships  (the  “what”)  
  • •  NIF  covers  mul>ple  structural  scales  and  domains  of  relevance  to  neuroscience   •  Aggregate  of  community  ontologies  with  some  extensions  for  neuroscience,  e.g.,  Gene   Ontology,  Chebi,  Protein  Ontology   NIFSTD   Organism   NS  Func>on  Molecule   Inves>ga>on   Subcellular   structure   Macromolecule   Gene   Molecule  Descriptors   Techniques   Reagent   Protocols   Cell   Resource   Instrument   Dysfunc>on   Quality   Anatomical   Structure   NIF  capitalizes  on  the  growing  set  of  community  ontologies   available  in  biomedical  science  
  • Purkinje   Cell   Axon   Terminal   Axon   Dendri>c   Tree   Dendri>c   Spine   Dendrite   Cell  body   Cerebellar   cortex   There  is  liZle  obvious  connec>on  between   data  sets  taken  at  different  scales  using   different  microscopies  without  an  explicit   representa>on  of  the  biological  objects  that   the  data  represent  
  • Brain   Cerebellum   Purkinje  Cell  Layer   Purkinje  cell   neuron   has  a   has  a   has  a   is  a   •  Ontology:  an  explicit,  formal  representa>on   of  concepts    rela>onships  among  them   within  a  par>cular  domain  that  expresses   human  knowledge  in  a  machine  readable   form   –  Branch  of  philosophy:    a  theory  of  what  is   –  e.g.,  Gene  ontologies   •  Provide  universals  for  naviga>ng  across   different  data  sources   –  Seman>c  “index”   •  Provide  the  basis  for  concept-­‐based   queries  to  probe  and  mine  data   –  Perform  reasoning   –  Link  data  through  rela>onships  not  just  one-­‐ to-­‐one  mappings  
  • “Search  compu6ng”   What  genes  are  upregulated  by  drugs  of  abuse   in  the  adult  mouse?   Morphine   Increased   expression   Adult  Mouse   Some  concepts,  e.g.,  age  category,  are  quan>ta>ve  but   s>ll  must  be  interpreted  in  a  global  query  system  
  • June10,  2013   dkCOIN  Inves>gator's  Retreat   32  
  • hZp://   Stephen  Larson   • Provide  a  simple   interface  for  defining  the   concepts  required   • Light  weight  seman>cs   • Good  teaching  tool  for   learning  about  seman>c   integra>on  and  the   benefits  of  a  consistent   seman>c  framework   • Community  based:   • Anyone  can  contribute   their  terms,  concepts,   things   • Anyone  can  edit   • Anyone  can  link   • Accessible:    searched  by   Google   • Growing  into  a  significant   knowledge  base  for   neuroscience   Demo    D03    200,000   edits    150   contributors  
  • •  NIF  can  be  used  to  survey  the   data  landscape   •  Analysis  of  NIF  shows  mul>ple   databases  with  similar  scope   and  content   •  Many  contain  par>ally   overlapping  data   •  Data  “flows”  from  one   resource  to  the  next   –  Data  is  reinterpreted,  reanalyzed  or   added  to   •  Is  duplica>on  good  or  bad?  
  • Databases  come  in  many  shapes  and  sizes   •  Primary  data:   –  Data  available  for  reanalysis,  e.g.,   microarray  data  sets  from  GEO;     brain  images  from  XNAT;     microscopic  images  (CCDB/CIL)   •  Secondary  data   –  Data  features  extracted  through   data  processing  and  some>mes   normaliza>on,  e.g,  brain  structure   volumes  (IBVD),  gene  expression   levels  (Allen  Brain  Atlas);    brain   connec>vity  statements  (BAMS)   •  Ter>ary  data   –  Claims  and  asser>ons  about  the   meaning  of  data   •  E.g.,  gene  upregula>on/ downregula>on,  brain   ac>va>on  as  a  func>on  of  task   •  Registries:   –  Metadata   –  Pointers  to  data  sets  or   materials  stored  elsewhere   •  Data  aggregators   –  Aggregate  data  of  the  same   type  from  mul>ple  sources,   e.g.,  Cell  Image   Library  ,SUMSdb,  Brede   •  Single  source   –  Data  acquired  within  a  single   context  ,  e.g.,  Allen  Brain  Atlas   Researchers  are  producing  a  variety  of   informa>on  ar>facts  using  a  mul>tude  of   technologies  
  • NIF  Analy6cs:    The  Neuroscience  Landscape   NIF  is  in  a  unique  posi>on  to  answer  ques>ons  about  the  neuroscience   landscape   Where  are  the  data?   Striatum   Hypothalamus   Olfactory  bulb   Cerebral  cortex   Brain   Brain  region   Data  source   Vadim  Astakhov,  Kepler  Workflow  Engine  
  • Diseases  of  nervous  system   Adding  more  seman6cs   The  combina>on  of  ontologies,  diverse  data  and  analy>cs  lets  us  look  at   the  current  landscape  in  interes>ng  ways       Neurodegenera>ve   Seizure  disorders   Neoplas>c  disease  of  nervous  system   NIH   Reporter   NIF  data  federated  sources  
  • •  Gemma:    Gene  ID    +  Gene  Symbol   •  DRG:    Gene  name  +  Probe  ID   •  Gemma  presented  results  rela>ve  to  baseline  chronic   morphine;    DRG  with  respect  to  saline,  so  direc>on  of  change  is   opposite  in  the  2  databases   •           Analysis:   • 1370  statements  from  Gemma  regarding  gene  expression  as  a  func>on  of  chronic   morphine   • 617  were  consistent  with  DRG;      over  half    of  the  claims  of  the  paper  were  not   confirmed  in  this  analysis   • Results  for  1  gene  were  opposite  in  DRG  and  Gemma   • 45  did  not  have  enough  informa>on  provided  in  the  paper  to  make  a  judgment   Rela>vely  simple  standards  would  make  life  easier  
  • NIF  favors  a  hybrid,  >ered,   federated  system   •  Domain  knowledge   –  Ontologies   •  Claims,  models  and   observa>ons   –  Virtuoso  RDF  triples     –  Model  repositories   •  Data   –  Data  federa>on   –  Spa>al  data   –  Workflows   •  Narra>ve   –  Full  text  access   Neuron   Brain  part   Disease   Organism   Gene   Caudate  projects  to   Snpc   Grm1  is  upregulated  in   chronic  cocaine   Betz  cells   degenerate  in  ALS   NIF  provides  the  tentacles  that  connect  the  pieces:    a   new  type  of  en>ty  for  21st  century  science   Technique   People  
  • •  2006-­‐2008:    A  survey  of  what  was  out  there   •  2008-­‐2009:    Strategy  for  resource  discovery   –  NIF  Registry  vs  NIF  data  federa>on   –  Inges>on  of  data  contained  within  different  technology  pla]orms,  e.g.,  XML  vs  rela>onal   vs  RDF   –  Effec>ve  search  across  seman>cally  diverse  sources   •  NIFSTD  ontologies   •  2009-­‐2011:    Strategy  for  data  integra>on   –  Unified  views  across  common  sources   –  Mapping  of  content  to  NIF  vocabularies   •  2011-­‐present:    Data  analy>cs   –  Uniform  external  data  references   •  2012-­‐present:      SciCrunch:    unified  biomedical  resource   services   NIF  provides  a  strategy  and  set  of  tools  applicable  to  all   domains  grappling  with  mul>ple  sources  of  diverse  data   (i.e.,  preZy  much  everything)  
  • •  Search  seman>cs   •  Ranking   •  Resources  supported  by  NIH  Blueprint  Ins>tutes  are   more  thoroughly  covered   •  Data  types,  e.g.,  Brain  ac>va>on  foci   June10,  2013   dkCOIN  Inves>gator's  Retreat   41  
  • June10,  2013   42   SciCrunch   NIF   MONARCH   Community   Services   dkCOIN   Shared   Resources   Undiagnosed   Disease  Program   Phenotype  RCN   3D  Virtual  Cell   Na>onal  Ins>tute   on  Aging   One  Mind  for   Research   BIRN   Interna>onal   Neuroinforma>cs   Coordina>ng   Facility   Model  Organism   Databases   Community   Outreach   DELSA   (not  just  a  data  catalog)  
  • 43   • 3dVC:    Focus  on  models  and  simula>on   • Gene  Ontology:    Focus  on   bioinforma>cs  tools   • Na>onal  Ins>tute  on  aging:  Aging-­‐ related  data  sets   • Monarch:    Phenotype-­‐Genotype;    deep   seman>c  data  integra>on   • One  Mind  for  Research:    Biospecimen   repositories   • NeuroGateway:    Computa>onal   resources   • FORCE11:    Tools  for  next-­‐gen  publishing   and  e-­‐scholarship   SciCrunch   SciCrunch  is  ac>vely  suppor>ng  mul>ple   communi>es;  mul>ple  communi>es  are   enriching    and  improving  SciCrunch        
  • Community   database:   beginning   Community   database:     End   “How  do  I  share  my   data/tool?”   “There  is  no  database   for  my  data”   1   2   3   4   Ins3tu3onal   repositories   Cloud   INCF:    Global   infrastructure   Government   Educa>on   Industry   NIF  is  designed  to  leverage  exis>ng  investments  in  resources  and  infrastructure   Tool  repositories  
  • •  No  one  can  be  stopped  from  doing  what  they  need  to  do     •  Every  resource  is  resource  limited:    few  have  enough  >me,  money,   staff  or    exper>se  required  to  do  everything  they  would  like   –  If  the  market  can  support  11  MRI  databases,  fine   –  Some  consolida>on,  coordina>on  is  warranted  though   •  Big,  broad  and  messy  beats  small,  narrow  and  neat   –  Without  trying  to  integrate  a  lot  of  data,  we  will  not  know  what  needs  to  be  done   –  A  lot  can  be  done  with  messy  data;    neatness  helps  though   –  Progressive  refinement;    addi>on  of  complexity  through  layers   •  Be  flexible  and  opportunis>c   –  A  single    op>mal  technology/container  for  all  types  of  scien>fic  data  and  informa>on  does  not  exist;     technology  is  changing   •  Think  globally;    act  locally:   –  No  source,  not  even  NIF,  is  THE  source;    we  are  all  a  source  
  • •  Several  powerful  trends  should  change  the  way  we  think  about   our  data:    One    Many   –  Many  data   •  Genera>on  of  data  is  gewng  easier    shared  data   •  Data  space  is  gewng  richer:    more  –omes  everyday   •  But...compared  to  the  biological  space,  s>ll  sparse   –  Many  eyes   •  Wisdom  of  crowds   •  More  than  one  way  to  interpret  data   –  Many  algorithms   •  Not  a  single  way  to  analyze  data   –  Many  analy>cs   •  “Signatures”  in  data  may  not  be  directly  related  to  the  ques>on  for  which  they   were  acquired  but  tell  us  something  really  interes>ng   Are  you  exposing  or  burying  your  work?  
  • Jeff  Grethe,  UCSD,  Co  Inves>gator,  Interim  PI   Amarnath  Gupta,  UCSD,  Co  Inves>gator   Anita  Bandrowski,  NIF  Project  Leader   Gordon  Shepherd,  Yale  University   Perry  Miller   Luis  Marenco   Rixin  Wang   David  Van  Essen,  Washington  University   Erin  Reid   Paul  Sternberg,  Cal  Tech   Arun  Rangarajan   Hans  Michael  Muller   Yuling  Li   Giorgio  Ascoli,  George  Mason  University   Sridevi  Polavarum   Fahim  Imam   Larry  Lui   Andrea  Arnaud  Stagg   Jonathan  Cachat   Jennifer  Lawrence   Svetlana  Sulima   Davis  Banks   Vadim  Astakhov   Xufei  Qian   Chris  Condit   Mark  Ellisman   Stephen  Larson   Willie  Wong   Tim  Clark,  Harvard  University   Paolo  Ciccarese   Karen  Skinner,  NIH,  Program  Officer   (re>red)   Jonathan  Pollock,  NIH,  Program  Officer   And  my  colleagues  in  Monarch,  dkNet,  3DVC,  Force  11