Your SlideShare is downloading. ×
Big data from small data:  A survey of the neuroscience landscape through the Neuroscience Information Framework
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big data from small data:  A survey of the neuroscience landscape through the Neuroscience Information Framework


Published on

Presentation on the NIF project to Sandia Labs, with an in depth look into NIF's data federation and strategies for creating on-line knowledge spaces

Presentation on the NIF project to Sandia Labs, with an in depth look into NIF's data federation and strategies for creating on-line knowledge spaces

Published in: Technology, Education

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Maryann  E.    Martone,  Ph.  D.   University  of  California,  San  Diego  
  • 2. Neuroscience  is  unlikely  to  be  served  by   a  few  large  databases  like  the  genomics   and  proteomics  community   Whole  brain  data   (20  um   microscopic  MRI)   Mosiac  LM   images  (1  GB+)   ConvenNonal  LM   images   Individual  cell   morphologies   EM  volumes  &   reconstrucNons   Solved  molecular   structures   No  single  technology  serves   these  all  equally  well.    Mul6ple  data  types;     mul6ple  scales;    mul6ple   databases  
  • 3. hPp://  
  • 4. •  NIF’s  mission  is  to  maximize  the  awareness  of,  access  to   and  uNlity  of  research  resources  produced  worldwide  to   enable  bePer  science  and  promote  efficient  use   –  NIF  unites  neuroscience  informaNon  without  respect  to  domain,   funding  agency,  insNtute  or  community   –  NIF  is  like  a  “Pub  Med”  for  all  biomedical  resources  and  a  “Pub   Med  Central”  for  databases   –  Makes  them  searchable  from  a  single  interface   –  PracNcal  and  cost-­‐effecNve;    tries  to  be  sensible   –  Learned  a  lot  about  current  data  prac6ces   The  Neuroscience  InformaNon  Framework  is  an  iniNaNve  of  the   NIH  Blueprint  consorNum  of  insNtutes        hPp://  
  • 5. h=p://   June10,  2013   dkCOIN  InvesNgator's  Retreat   6   •  A  portal  for  finding  and  using   neuroscience  resources     A  consistent  framework  for   describing  resources     Provides  simultaneous   search  of  mulNple  types  of   informaNon,  organized  by   category     Supported  by  an  expansive   ontology  for  neuroscience     UNlizes  advanced   technologies  to  search  the   “hidden  web”   UCSD,  Yale,  Cal  Tech,  George  Mason,  Washington  Univ   Literature   Database   FederaNon   Registry  
  • 6. We’d  like  to  be  able  to  find:   •  What  is  known****:   –  What  are  the  projecNons  of  hippocampus?   –  Is  GRM1  expressed  In  cerebral  cortex?   –  What  genes  have  been  found  to  be  upregulated  in   chronic  drug  abuse  in  adults   –  What  animal  models  have  similar  phenotypes  to   Parkinson’s  disease?   –  What  studies  used  my  polyclonal  anNbody  against   GABA  in  humans?   •  What  is  not  known:   –  ConnecNons  among  data   –  Gaps  in  knowledge   A  framework  makes  it  easier  to  address  these  quesNons  
  • 7. With  the  thousands  of  databases  and  other  informaNon  sources   available,  simple  descripNve  metadata  will  not  suffice  
  • 8. • NIF  curators   • NominaNon  by  the   community   • Semi-­‐automated  text   mining  pipelines    NIF  Registry    Requires  no  special   skills    Site  map  available   for  local  hosNng   • NIF  Data  FederaNon   • DISCO  interop   • Requires  some   programming  skill   • Open  Source  Brain  <   2  hr   Two  Nered  system:    low  barrier  to  entry  
  • 9. Current   Planned   DISCO  Dashboard  Func6ons   •  Ingest  Script  Manager   •  Public  Script  Repository   •  Data  &  Event  Tracker   •  Versioning  System   •  Curator  Tool     •  Data  Transformer  Manager   June10,  2013   dkCOIN  InvesNgator's  Retreat   11  Luis  Marenco,  Rixin  Wang,  Perrry  Miller,  Gordon  Shepherd   Yale  University  
  • 10. NIF  was  designed  to  be  populated  rapidly   with  progressive  refinement  
  • 11. Databases  come  in  many  shapes  and  sizes   •  Primary  data:   –  Data  available  for  reanalysis,  e.g.,   microarray  data  sets  from  GEO;     brain  images  from  XNAT;     microscopic  images  (CCDB/CIL)   •  Secondary  data   –  Data  features  extracted  through   data  processing  and  someNmes   normalizaNon,  e.g,  brain  structure   volumes  (IBVD),  gene  expression   levels  (Allen  Brain  Atlas);    brain   connecNvity  statements  (BAMS)   •  TerNary  data   –  Claims  and  asserNons  about  the   meaning  of  data   •  E.g.,  gene  upregulaNon/ downregulaNon,  brain   acNvaNon  as  a  funcNon  of  task   •  Registries:   –  Metadata   –  Pointers  to  data  sets  or   materials  stored  elsewhere   •  Data  aggregators   –  Aggregate  data  of  the  same   type  from  mulNple  sources,   e.g.,  Cell  Image   Library  ,SUMSdb,  Brede   •  Single  source   –  Data  acquired  within  a  single   context  ,  e.g.,  Allen  Brain  Atlas   Researchers  are  producing  a  variety  of   informaNon  arNfacts  using  a  mulNtude  of   technologies  
  • 12. Hippocampus  OR  “Cornu  Ammonis”  OR   “Ammon’s  horn”   Query  expansion:    Synonyms   and  related  concepts   Boolean  queries   Data  sources   categorized  by   “data  type”  and   level  of  nervous   system   Common  views   across  mulNple   sources   Tutorials  for  using   full  resource  when   geong  there  from   NIF   Link  back  to   record  in   original  source  
  • 13. Connects  to   Synapsed  with   Synapsed  by   Input  region   innervates   Axon  innervates   Projects  to  Cellular  contact   Subcellular  contact   Source  site   Target    site   Each  resource  implements  a  different,  though  related  model;     systems  are  complex  and  difficult  to  learn,  in  many  cases  
  • 14. •  You  (and  the  machine)  have  to  be  able  to  find  it   –  Accessible  through  the  web   –  Structured  or  semi-­‐structured   –  AnnotaNons   •  You  (and  the  machine)    have  to  be  able  to  use  it   –  Data  type  specified  and  in  an  acNonable  form   •  You  (and  the  machine)  have  to  know  what  the  data   mean   •  SemanNcs   •  Context:    Experimental  metadata   •  Provenance:    where  did  they  come  from  
  • 15. Knowledge  in  space  and  spaNal  relaNonships   (the  “where”)   Knowledge  in  words,  terminologies  and   logical  relaNonships  (the  “what”)  
  • 16. Purkinje   Cell   Axon   Terminal   Axon   DendriNc   Tree   DendriNc   Spine   Dendrite   Cell  body   Cerebellar   cortex   There  is  liPle  obvious  connecNon  between   data  sets  taken  at  different  scales  using   different  microscopies  without  an  explicit   representaNon  of  the  biological  objects  that   the  data  represent  
  • 17. •  NIF  covers  mulNple  structural  scales  and  domains  of  relevance  to  neuroscience   •  Aggregate  of  community  ontologies  with  some  extensions  for  neuroscience,  e.g.,  Gene   Ontology,  Chebi,  Protein  Ontology   NIFSTD   Organism   NS  FuncNon  Molecule   InvesNgaNon   Subcellular   structure   Macromolecule   Gene   Molecule  Descriptors   Techniques   Reagent   Protocols   Cell   Resource   Instrument   DysfuncNon   Quality   Anatomical   Structure  
  • 18. Brain   Cerebellum   Purkinje  Cell  Layer   Purkinje  cell   neuron   has  a   has  a   has  a   is  a   •  Ontology:  an  explicit,  formal   representaNon  of  concepts     relaNonships  among  them  within   a  parNcular  domain  that   expresses  human  knowledge  in  a   machine  readable  form   •  Branch  of  philosophy:    a  theory   of  what  is   •  e.g.,  Gene  ontologies  
  • 19. •  Express  neuroscience  concepts  in  a  way  that  is  machine  readable     –  Synonyms,  lexical  variants   –  DefiniNons   •  Provide  means  of  disambiguaNon  of  strings   –  Nucleus  part  of  cell;    nucleus  part  of  brain;    nucleus  part  of  atom   •  Rules  by  which  a  class  is  defined,  e.g.,  a  GABAergic  neuron  is  neuron  that  releases  GABA  as  a   neurotransmiPer   •  ProperNes   –  Support  reasoning   •  Provide  universals  for  navigaNng  across  different  data  sources   –  SemanNc  “index”   –  Link  data  through  relaNonships  not  just  one-­‐to-­‐one  mappings   •  Provide  the  basis  for  concept-­‐based  queries  to  probe  and  mine  data   •  Establish  a  semanNc  framework  for  landscape  analysis   MathemaNcs,  Computer  code  or  Esperanto  
  • 20. birnlex_1732   Brodmann.1   Explicit  mapping  of  database  content  helps  disambiguate  non-­‐unique  and  custom   terminology  
  • 21. June10,  2013   24   Aligns  sources  to  the  NIF  semanNc  framework  
  • 22. •  Search  Google:    GABAergic  neuron   •  Search  NIF:    GABAergic  neuron   –  NIF  automaNcally  searches  for  types  of   GABAergic  neurons   Types  of  GABAergic   neurons   Search by meaning not by string
  • 23. Equivalence  classes;    restricNons   Arbitrary  but  defensible   • Neurons  classified  by   • Circuit  role:    principal  neuron  vs   interneuron   • Molecular  consNtuent:    Parvalbumin-­‐ neurons,  calbindin-­‐neurons   • Brain  region:    Cerebellar  neuron   • Morphology:    Spiny  neuron   •   Molecule  Roles:    Drug  of  abuse,  anterograde   tracer,  retrograde  tracer   • Brain  parts:    Circumventricular  organ   • Organisms:    Non-­‐human  primate,  non-­‐human   vertebrate   • QualiNes:    Expression  level   • Techniques:    Neuroimaging  
  • 24. What  genes  are  upregulated  by  drugs  of  abuse  in  the   adult  mouse?  (show  me  the  data!)   Morphine   Increased   expression   Adult  Mouse  
  • 25. • NIF  ConnecNvity:    7  databases  containing  connecNvity  primary  data  or  claims   from  literature  on  connecNvity  between  brain  regions   • Brain  Architecture  Management  System  (rodent)   • Temporal  (rodent)   • Connectome  Wiki  (human)   • Brain  Maps  (various)   • CoCoMac  (primate  cortex)   • UCLA  MulNmodal  database  (Human  fMRI)   • Avian  Brain  ConnecNvity  Database  (Bird)   • Total:    1800  unique  brain  terms  (excluding  Avian)   • Number  of  exact  terms  used  in  >  1  database:    42   • Number  of  synonym  matches:    99   • Number  of  1st  order  partonomy  matches:    385  
  • 26. hPp://   • SemanNc  MediWiki   • Provide  a  simple  interface   for  defining  the  concepts   required   • Light  weight  semanNcs   • Good  teaching  tool  for   learning  about  semanNc   integraNon  and  the  benefits  of   a  consistent  semanNc   framework   • Community  based:   • Anyone  can  contribute  their   terms,  concepts,  things   • Anyone  can  edit   • Anyone  can  link   • Accessible:    searched  by  Google   • Growing  into  a  significant   knowledge  base  for   neuroscience   • InternaNonal  NeuroinformaNcs   CoordinaNng  Facility     Demo    D03   Larson  et  al,  FronNers  in  NeuroinformaNcs,  in  press  
  • 27. •  Neurolex  provides  an   on-­‐line  computable   index  for  expressing   models  in  semanNc   terms,  and  linking  to   other  knowledge  and   data   •  Implemented  forms   for  certain  types  of   enNNes   •  Neuroscience   knowledge  in  the  web   Pages  are  linked  through  properNes;    Knowledge-­‐base  built  through  cross-­‐ modular  relaNons  and  links  to  data;    red  links  
  • 28. •  >  1000  Dicom  Terms   –  Karl  Helmer   –  Data  Sharing  Task  Force   •  Tasks  and  CogniNve  Concepts   from  CogniNve  Atlas   –  Russ  Poldrack   •  >280  Neurons   –  Gordon  Shepherd  and  30  world   wide  experts   •  ~500  fly  neurons  from  Fly   Anatomy  Ontology   –  David  Osumi-­‐Sutherland   •  >1200  Brain  parcellaNons   `20,000  concepts:      Spreadsheet  downloads,  through  NIF  Web  Services,   SPARQL  endpoint    200,000   edits    150   contributors  
  • 29. Because  they  are  staNc  URL’s,  Wikis  are  searchable  by   Google  
  • 30. Neurolex:    >  1  million  triples Dr.  Yi  Zeng:    Chinese  neural  knowledge  base   NIF  Cell  Graph  
  • 31. 1.  Look  brain  region  up  in  NeuroLex   2.  Look  up  cells  contained  in  the  brain   region   3.  Find  those  cells  that  are  known  to  project   out  of  that  brain  region   4.  Look  up  the  neurotransmiPers  for  those   cells   5.  Determine  whether  those   neurotransmiPers  are  known  to  be   excitatory  or  inhibitory   6.  Report  the  projecNon  as  excitatory  or   inhibitory,  and  report  the  enNre  chain  of   logic  with  links  back  to  the  wiki  pages   where  they  were  made   7.  Make  sure  user  can  get  back  to  each   statement  in  the  logic  chain  to  edit  it  if   they  think  it  is  wrong   Stephen  Larson   CHEBI:18243   Are  projecNons  from  the  VTA  excitatory   or  inhibitory?  
  • 32. •  INCF  Project   –  Neuron  Registry   –  >  30  experts   worldwide   –  Fill  out  neuron   pages  in  Neurolex   Wiki   –  Led  by  Dr.  Gordon   Shepherd   Soma  locaNon   Dendrite  locaNon   Axon  locaNon   0   50   100   150   200   250   300   Number   Total   redlinks   easy  fixes   hard  fixes   Soma  locaNon   Dendrite  locaNon   Axon  locaNon   Social  networks  and  community  sites  let  us  learn  things  from  the   collecNve  behavior  of  contributors  
  • 33. 37 Semantic Wiki • INCF Community encyclopedia • Define all vocabulary, terms, protocols, brain structures, diseases, etc • Living review articles • Links to data, models and literature • Semantic organization, search, analysis and integration • Searchable via the web • Global directory of all shared vocabularies, CDEs, etc Slide  courtesy  of  Sean  Hill:    InternaNonal  NeuroinformaNcs  CoordinaNng  Facility  
  • 34. MarNn  Telefont,  HBP:    Lab  Space  connecNng  to  Knowledge  Space  
  • 35. •  NIF  can  be  used  to  survey  the   data  landscape   •  Analysis  of  NIF  shows  mulNple   databases  with  similar  scope   and  content   •  Many  contain  parNally   overlapping  data   •  Data  “flows”  from  one   resource  to  the  next   –  Data  is  reinterpreted,  reanalyzed  or   added  to   •  Is  duplicaNon  good  or  bad?   NIF  is  trying  to  make  it  easier  to  work  with  diverse  data  
  • 36. NIF  is  in  a  unique  posiNon  to  answer  quesNons  about  the  neuroscience   landscape:    Kepler  Workflow  engine  +  NIF  semanNcs   Where  are  the  data?   Striatum   Hypothalamus   Olfactory  bulb   Cerebral  cortex   Brain   Brain  region   Data  source  
  • 37. ∞   What  is  easily  machine   processable  and  accessible   What  is  potenNally  knowable   What  is  known:   Literature,  images,  human   knowledge   Unstructured;     Natural  language   processing,  enNty   recogniNon,  image   processing  and   analysis;  paywalls   communicaNon   Abstracts  vs  full   text  vs  tables  etc  
  • 38. Closed  world  vs  open  world   We  know  a  lot  about  some  things  and  less  about  others;    some   of  NIF’s  sources  are  comprehensive;    others  are  highly  biased   But...NIF  has  >  2M  anNbodies,   338,000  model  organisms,  and  3   million  microarray  records  
  • 39. Neocortex   Olfactory  bulb   Neostriatum   Cochlear  nucleus   All  neurons  with  cell  bodies  in  the  same  brain  region  are  grouped   together   ProperNes  in  Neurolex  
  • 40. Exposing  knowledge  gaps  and  biases   Where  are  the  data?   Striatum   Hypothalamus   Olfactory  bulb   Cerebral  cortex   Brain   Brain  region   Data  source   Funding  
  • 41. •  Gemma:    Gene  ID    +  Gene  Symbol   •  DRG:    Gene  name  +  Probe  ID   •  Gemma  presented  results  relaNve  to  baseline  chronic   morphine;    DRG  with  respect  to  saline,  so  direcNon  of  change  is   opposite  in  the  2  databases   •           Analysis:   • 1370  statements  from  Gemma  regarding  gene  expression  as  a  funcNon  of  chronic   morphine   • 617  were  consistent  with  DRG;      over  half    of  the  claims  of  the  paper  were  not   confirmed  in  this  analysis   • Results  for  1  gene  were  opposite  in  DRG  and  Gemma   • 45  did  not  have  enough  informaNon  provided  in  the  paper  to  make  a  judgment   RelaNvely  simple  standards  would  make  life  easier  
  • 42. NIF  favors  a  hybrid,  Nered,   federated  system   •  Domain  knowledge   –  Ontologies   •  Claims,  models  and   observaNons   –  Virtuoso  RDF  triples     –  Model  repositories   •  Data   –  Data  federaNon   –  SpaNal  data   –  Workflows   •  NarraNve   –  Full  text  access   Neuron   Brain  part   Disease   Organism   Gene   Caudate  projects  to   Snpc   Grm1  is  upregulated  in   chronic  cocaine   Betz  cells   degenerate  in  ALS   NIF  provides  the  tentacles  that  connect  the  pieces:    a   new  type  of  enNty  for  21st  century  science   Technique   People  
  • 43. Scholar   Library   Scholar   Publisher    Future  of  research  communicaNons  and  e-­‐scholarship  
  • 44. Scholar   Consumer   Libraries   Data  Repositories   Code  Repositories   Community  databases/ pla}orms   OA   Curators   Social   Networks   Social   Networks  Social   Networks   Peer  Reviewers   NarraNve   Workflows   Data   Models   MulNmedia   NanopublicaNons   Code  
  • 45. •  Of  the  ~  4000  columns   that  NIF  queries,   ~1300  map  to  one  of   our  core  categories:   –  Organism   –  Anatomical  structure   –  Cell   –  Molecule   –  FuncNon   –  DysfuncNon   –  Technique   •  30-­‐50%  of  NIF’s   queries  autocomplete   •  When  NIF  combines   mulNple  sources,  a  set   of  common  fields   emerges   –  >Basic  informaNon   models/semanNc   models  exist  for   certain  types  of   enNNes   SemanNc  frameworks  create  spaces  in  which  to  compare  the  current  state  of   data  and  knowledge  
  • 46. •  Several  powerful  trends  should  change  the  way  we  think  about  our   data:    One    Many   –  Many  data   •  GeneraNon  of  data  is  geong  easier    shared  data   •  Data  space  is  geong  richer:    more  –omes  everyday   •  But...compared  to  the  biological  space,  sNll  sparse   –  Many  resources:    everyone  wants  to  be  “the”  one  but  e  pluribus  unum   –  Many  eyes   •  Wisdom  of  crowds   •  More  than  one  way  to  interpret  data   –  Many  algorithms   •  Not  a  single  way  to  analyze  data   –  Many  analyNcs   •  “Signatures”  in  data  may  not  be  directly  related  to  the  quesNon  for  which  they   were  acquired  but  tell  us  something  really  interesNng   New  works  need  to  be  created  with  an  eye   towards  the  web  and  interoperability  
  • 47. Jeff  Grethe,  UCSD,  Co  InvesNgator,  Interim  PI   Amarnath  Gupta,  UCSD,  Co  InvesNgator   Anita  Bandrowski,  NIF  Project  Leader   Gordon  Shepherd,  Yale  University   Perry  Miller   Luis  Marenco   Rixin  Wang   David  Van  Essen,  Washington  University   Erin  Reid   Paul  Sternberg,  Cal  Tech   Arun  Rangarajan   Hans  Michael  Muller   Yuling  Li   Giorgio  Ascoli,  George  Mason  University   Sridevi  Polavarum   Fahim  Imam   Larry  Lui   Andrea  Arnaud  Stagg   Jonathan  Cachat   Jennifer  Lawrence   Svetlana  Sulima   Davis  Banks   Vadim  Astakhov   Xufei  Qian   Chris  Condit   Mark  Ellisman   Stephen  Larson   Willie  Wong   Tim  Clark,  Harvard  University   Paolo  Ciccarese   Karen  Skinner,  NIH,  Program  Officer   (reNred)   Jonathan  Pollock,  NIH,  Program  Officer   And  my  colleagues  in  Monarch,  dkNet,  3DVC,  Force  11  
  • 48. Data  Space   Laboratory   Space   Knowledge   Space   BAMS   Lexicon   Encyclopedia  
  • 49. 47/50  major  preclinical   published  cancer  studies   could  not  be  replicated   •  “The  scienNfic  community   assumes  that  the  claims  in  a   preclinical  study  can  be  taken  at   face  value-­‐that  although  there   might  be  some  errors  in  detail,   the  main  message  of  the  paper   can  be  relied  on  and  the  data   will,  for  the  most  part,  stand  the   test  of  Nme.    Unfortunately,  this   is  not  always  the  case.”     •  Geong  data  out  sooner  in  a   form  where  they  can  be   exposed  to  many  eyes  and   many  analyses  may  allow  us   to  expose  errors  and  develop   bePer  metrics  to  evaluate  the   validity  of  data   Begley  and  Ellis,  29  MARCH  2012  |  VOL  483  |   NATURE  |  531  
  • 50. •  Every  resource  is  resource  limited:    few  have  enough  Nme,  money,   staff  or    experNse  required  to  do  everything  they  would  like   –  If  the  market  can  support  11  MRI  databases,  fine   –  Some  consolidaNon,  coordinaNon  is  usually  warranted   •  Big,  broad  and  messy  beats  small,  narrow  and  neat   –  Without  trying  to  integrate  a  lot  of  data,  we  will  not  know  what  needs  to  be  done   –  Progressive  refinement;    addiNon  of  complexity  through  layers   •  Be  flexible  and  opportunisNc   –  A  single    opNmal  technology/container  for  all  types  of  scienNfic  data  and  informaNon   does  not  exist;    technology  is  changing   •  Think  globally;    act  locally:   –  No  source,  not  even  NIF,  is  THE  source;    we  are  all  a  source   –  Think  about  interoperaNon  from  the  incepNon  
  • 51. Regional  part  of   nervous  system   ParcellaNon   scheme  parcel   ParcellaNon   scheme  parcel   Single  species  or  strain   ParcellaNon  scheme   Precise  definiNon   Technique   INCF  Task  Force:    Alan  Rutenberg,    Seth  Ruffins     FuncNonal  part  of   nervous  system   ParNally  overlaps   Taxon  rank   General  hierarchy  
  • 52.  1200  parts  of  nervous   system  characterized   (mostly)    according  to   CUMBO  terms    1200  “parcels”  from   individual  atlases/papers    700  neurons    280  via  Neuron   Registry    Available  via  NIF   vocabulary  services  (REST)    Hosted  in  a  Virtuoso   triple  store  via  SPARQL