CALIPHO: two missions for one goal:increasing our knowledge on humanproteinsAmos	  Bairoch	  April	  4,	  2012	  
Computer	  Analysis	  and	  Laboratory	    Inves6ga6on	  of	  Proteins	  of	  Human	  Origin	                    Its	  two...
‘The’	  human	  genome	          •  Sequencing	  a	  human	             genome	  is	  no	  longer	  a	             technol...
Almost 12 years ago, at the 4thSiena meeting, we proposed toannotate in Swiss-Prot all thehuman proteins
Some	  stats	  on	  human	  proteins	  from	                  UniProtKB/Swiss-­‐Prot	  •  20’244	  reviewed	  entries	  (~...
Some	  issues	  about	  protein-­‐coding	  genes	  •  We	  completely	  agree	  with	  what	  was	  shown	  earlier	  in	 ...
The	  PTM	  world	  is	  s6ll	  largely	  uncharted	   1-(3R)-3-hydroxyasparagine, (3R)-3-hydroxyaspartate, (3S)-3-hydroxy...
From	  genome	  to	  proteome	  ~ 20’000 protein                                          ~ 5000000  coding-genes         ...
The	  complexity	  of	  life	  and	  of	  its	  molecular	  actors	  is	  fractal	  
Many	  human	  proteins	  for	  which	  we	              lack	  func6onal	  knowledge	  1.  Similar	  to	  characterized	 ...
Overview	  of	  the	  CALIPHO	  wet	  lab	  strategy	                                                         In	  silico	...
Aner	  2.5	  years…	  •  A	  protein	  involved	  in	  ciliogenesis;	  •  An	  enzyme	  involved	  in	  a	  salvage	  path...
Personal	  view	  •  Cons:	      –  It	  takes	  much	  longer	  than	  what	  you	  expect	  or	  want!	  And	         ma...
•  What:	  a	  comprehensive	  resource	  that	  complements	  SIB/   EBI	  Swiss-­‐Prot	  human	  protein	  annota6on	  e...
Sequence databases              Enzyme and pathwayProteomics            EMBL                            databasesHPA      ...
What	  is	  not	  neXtProt?	  •  No,	  neXtProt	  is	  not	  a	  replacement	  for	     UniProtKB/Swiss-­‐Prot;	  •  No,	 ...
When	  and	  what?	  •  In	  early	  2011	  we	  released	  a	  first	  public	  version	  that	  contained	  in	     terms...
Bronze,	  silver	  and	  gold	  •  We	  have	  a	  three-­‐6ered	  approach	  as	  to	  data	     quality:	      –  Bronze...
Query	  features	  
A	  variety	  of	  views	  for	  a	  single	  protein	  
An	  innova6ve	  sequence	  viewer	  
Informa6on	  at	  the	  genomic	  level	  
Expression	  data	  at	  mRNA	  and	  protein	  levels	  	  
A	  new	  proteomics	  page	  
PTMs	  We	  are	  loading	  high-­‐quality	  sets	  of	  PTMs,	  star6ng	  with	  N-­‐glycosyla6on	  and	  phosphoryla6on	  
Pep6de	  iden6fica6ons	  •  HUPO	  brain	  and	  plasma	  project	  pep6des	  from	       Pep6deAtlas;	  •  Sets	  linked	 ...
New	  subcellular	  localiza6on	  data	  •  From	  two	  projects:	  DKFZ	  GFP-­‐cDNA@EMBL	  and	     WIS	  Kahn	  Dynami...
Data	  export	  •  Export	  of	  data	  both	  in	  XML	  and	  in	  PEFF	  formats;	  •  neXtProt	  is	  the	  first	  res...
Download	  by	  FTP	  •  At	  •  To	  obtain	  downloads	  in	  XML	  or	  PEFF;	  •  These	  files	  are...
What’s	  next	  in	  term	  of	  tools	  •  A	  tool	  for	  the	  the	  analysis	  of	  lists	  of	  proteins	  	  so	  a...
Programma6c	  access	  •  We	  will	  build	  an	  API	  to	  allow	  third	  party	  sonware	     tools	  to	  make	  use...
A	  note	  about	  variants	  •  There	  are	  now	  over	  420’000	  variants	  loaded	  in	     neXtProt;	  •  The	  65’...
We	  also	  want	  to	  do	  many	  other	  things	  as	               quickly	  as	  possible	  but…	  
The	  road	  map:	  principles	  •  Our	  vision	  is	  to	  gradually	  build	  up	  neXtProt,	  not	     only	  by	  add...
A	  new	  resource	  for	  cell	  lines	  •  There	  are	  three	  ontologies	  catering	  for	  cell	  lines	     (MCCL	 ...
 •  Not	  an	  ontology,	  but	  a	  thesaurus;	  •  Links	  to	  all	  the	  ontologies,	  catalogs,	  resources,	     pu...
ID    22Rv1!AC    CVCL_1045!SY    22RV1; 22Rv-1; CWR22-Rv1; CWR22R-V1; CWR22Rv1!DR    CLO; CLO_0001199!DR    CLO; CLO_0001...
The	  ISB	  •  A	  young	  society	  but	  already	  very	  ac6ve:	  •  Pros:	  	      –  Over	  310	  ac6ve	  members	  f...
Biocura6on	  is	  an	  expanding	  field	  •  Good	  news:	      –  Increasing	  number	  of	  biocurators	  in	  academia	...
The	  data	                                                flood	  •  Yes	  it	  exists	  but…..	  •  A	  big	  propor6on	 ...
CALIPHO@UniGe_and_SIB	    •  neXtProt	  content:	  	        –  Coordinator:	  Pascale	  Gaudet	        –  Biocurators:	  G...
Bairoch ISB closing-talk: CALIPHO
Bairoch ISB closing-talk: CALIPHO
Upcoming SlideShare
Loading in...5

Bairoch ISB closing-talk: CALIPHO


Published on

Plenary talk from Dr Amos Bairoch presented at the 5th International Biocuration Conference, hosted by PIR in Washington, DC, April 2-4, 2012.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bairoch ISB closing-talk: CALIPHO

  1. 1. CALIPHO: two missions for one goal:increasing our knowledge on humanproteinsAmos  Bairoch  April  4,  2012  
  2. 2. Computer  Analysis  and  Laboratory   Inves6ga6on  of  Proteins  of  Human  Origin   Its  two  missions:  Carry  out  laboratory  experiments  on  selected  sets   of  uncharacterized  human  proteins  to  discover   their  func6on   Develop  neXtProt,  an  ambi6ous  new  knowledge   resource  centered  around  human  proteins  
  3. 3. ‘The’  human  genome   •  Sequencing  a  human   genome  is  no  longer  a   technological  challenge;   •  Making  sense  of  what  it   tells  us  is  s6ll  much  more   problema6c  then  anyone   ever  expected.  
  4. 4. Almost 12 years ago, at the 4thSiena meeting, we proposed toannotate in Swiss-Prot all thehuman proteins
  5. 5. Some  stats  on  human  proteins  from   UniProtKB/Swiss-­‐Prot  •  20’244  reviewed  entries  (~protein-­‐coding  genes);  •  16’000  addi6onal  isoforms  in  about  8’100  entries   (40%  but  will  probably  rise  to  >60%):  50’000   different  protein  sequences;  •  65’000  variants;  22’500  linked  to  diseases;  the  rest   are  SNPs  that  are  SAPs  (2  per  proteins).  This  is  the   6p  of  the  iceberg;  •  80’000  PTMs  (50%  of  which  are  experimental).  This   is  the  6p  of  the  6p  of  the  iceberg!  
  6. 6. Some  issues  about  protein-­‐coding  genes  •  We  completely  agree  with  what  was  shown  earlier  in  this   mee6ng  by  the  HAVANA  group:  that  there  are  slightly  less   then  20K  protein-­‐coding  genes;  •  Many  weirdos  in  the  genome:  bicistronic  mRNAs,  genes   that  produce  through  splicing  proteins  with  no  sequence   rela6onship,  mul6ple  genes  for  the  same  protein,  etc;  •  Varia6on  in  term  not  only  of  SNPs  but  of  copy  number.   And  some  segrega6ng  pseudogenes  (olfactory  receptors);  •  How  many  have  been  proven  at  protein  level?   –  Using  the  protein  evidence  “metric”  used  at  UniProt  and   neXtProt,  we  are  now  at  about  70%;   –  But  if  we  were  hun6ng  everywhere  in  good-­‐quality  MS  data,  it   would  rise  to  about  85%.  The  big  issue  in  proteomics  is  how  to   hunt  for  the  last  15%  
  7. 7. The  PTM  world  is  s6ll  largely  uncharted   1-(3R)-3-hydroxyasparagine, (3R)-3-hydroxyaspartate, (3S)-3-hydroxyasparagine, 1-histidyl-3-tyrosine,thioglycine, 2,4,5-topaquinone, 2,3-didehydroalanine, 3-(S-cysteinyl)-tyrosine, 3-hydroxyproline, 3-oxoalanine, 4-amino-3-isothiazolidinone serine, 4-carboxyglutamate, 4-hydroxyproline, 5-glutamyl, 5-glutamyl glycerylphosphorylethanolamine, 5-hydroxylysine, 5-imidazolinone, ADP-ribosylasparagine, ADP-ribosylcysteine, ADP-ribosylserine, Allysine, Arginine amide, Asparagine amide, Aspartate 1-(chondroitin4-sulfate)-ester, Asymmetric dimethylarginine, Beta-decarboxylated aspartate, Cholesterol glycine ester,Citrulline, Cysteine methyl ester, Cysteine sulfenic acid, Cysteinyl-selenocysteine, Deamidatedasparagine, Deamidated glutamine, Dimethylated arginine, Diphthamide, Disulfide bond, GPI-anchoramidated alanine, GPI-anchor amidated asparagine, GPI-anchor amidated aspartate, GPI-anchoramidated cysteine, GPI-anchor amidated glycine, GPI-anchor amidated serine, Glutamic acid 1-amide,Glutamine amide, Glycine amide, Glycyl adenylate, Glycyl lysine isopeptide, Hydroxyproline,Hydroxyproline, Hypusine, Isoglutamyl cysteine thioester, Isoglutamyl lysine isopeptide, Isoleucine amide,Leucine amide, Leucine methyl ester, Lysine amide, Lysine tyrosylquinone, Methionine amide, N,N,N- And  all  this  does  not  include  all  the  different  trimethylalanine, N-acetylalanine, N-acetylaspartate, N-acetylcysteine, N-acetylglutamate, N-acetylglycine,N-acetylmethionine, N-acetylproline, N-acetylserine, N-acetylthreonine, N-acetylvaline, N-myristoyl glycosyla6on  forms  and  the  processing  events  glycine, N-palmitoyl cysteine, N-palmitoyl glycine, N-pyruvate 2-iminyl-valine, N4,N4-dimethylasparagine,N6,N6,N6-trimethyllysine, N6,N6-dimethyllysine, N6-(pyridoxal phosphate)lysine, N6-(retinylidene)lysine,N6-1-carboxyethyl lysine, N6-acetyllysine, N6-biotinyllysine, N6-carboxylysine, N6-lipoyllysine, N6-methylated lysine, N6-methyllysine, N6-myristoyl lysine, Nitrated tyrosine, O-(pantetheine 4-phosphoryl)serine, O-AMP-threonine, O-AMP-tyrosine, O-acetylserine, O-acetylthreonine, O-decanoylserine, O-palmitoyl serine, Omega-N-methylarginine, Omega-N-methylated arginine, Omega-hydroxyceramide glutamate ester, Phenylalanine amide, Phosphatidylethanolamine amidated glycine,Phosphohistidine, Phosphoserine, Phosphothreonine, Phosphotyrosine, PolyADP-ribosyl glutamic acid,Proline amide, Pyrrolidone carboxylic acid, Pyruvic acid, S-(dipyrrolylmethanemethyl)cysteine, S-8alpha-FAD cysteine, S-Lysyl-methionine sulfilimine, S-cysteinyl cysteine, S-farnesyl cysteine, S-geranylgeranylcysteine, S-glutathionyl cysteine, S-methylcysteine, S-nitrosocysteine, S-palmitoyl cysteine, S-stearoylcysteine, Sulfoserine, Sulfotyrosine, Symmetric dimethylarginine, Tele-8alpha-FAD histidine, Tele-
  8. 8. From  genome  to  proteome  ~ 20’000 protein ~ 5000000 coding-genes different proteins post-translational modifications of proteinsalternative splicing (PTMs) of mRNA 50-100 fold increase 2-5 fold increase ~ 50 to 100’000 transcripts (mRNAs)   Protein  complexity    
  9. 9. The  complexity  of  life  and  of  its  molecular  actors  is  fractal  
  10. 10. Many  human  proteins  for  which  we   lack  func6onal  knowledge  1.  Similar  to  characterized  proteins  in  distant   organisms  (bacteria,  plants,  yeast),  but  no  valida6on   in  mammals;  2.  Presence  of  domains  that  help  predict  a  ‘general’   func6on  but  not  a  precise  one  (examples:  hydrolase   fold,  GPCR);  3.  Presence  of  domains  or  sequence  features  that  help   define  some  proper6es  (examples:  PDZ  -­‐>  PPI,  many   TMs  -­‐>  integral  membrane  protein);  4.  “Orphan”.  With  no  similarity  to  any  characterized   proteins  but  that  can  be  conserved  across  a  more  or   less  wide  taxonomic  space.  About  5’000  human  proteins  are  in  one  of  the  above  four  categories  
  11. 11. Overview  of  the  CALIPHO  wet  lab  strategy   In  silico  selecCon  :  sequence  analysis,  phylogeny,  data  mining   Tissue/cell  line  expression  (RT-­‐PCR)   Cloning  of  cDNA   in  the  Gateway  system   Yeast  two  hybrid   Subcellular  locaCon  in  HeLa  cells   Recombinant  protein     (confocal  imaging)   producCon  in  E.Coli   ValidaCon  of     protein-­‐protein  interacCons     (GST  pull  down,  co-­‐IP)   3D  structure     by  NMR   Data  mining,     Modelling   Hypothesis  generaCon   FuncConal  assays     on  cell  lines  (RNAi)   In  vivo  validaCon     (animal  models  eg  zebrafish)  CALIPHO@UniGe   collaborators   CALIPHO@SIB  
  12. 12. Aner  2.5  years…  •  A  protein  involved  in  ciliogenesis;  •  An  enzyme  involved  in  a  salvage  pathway  not   yet  characterized  in  vertebrates;  •  A  myristoylated  and  palmitoylated  protein   that  could  be  involved  in  membrane  blebbing;  •  A  mitochondrial  protein  that  may  play  a  role   in  a  Mt  import  mechanism.  
  13. 13. Personal  view  •  Cons:   –  It  takes  much  longer  than  what  you  expect  or  want!  And   magic  and  luck  seem  to  be  the  most  important  factors  in   successful    experiments!   –  The  low  ra6o  of  quality/cost  for  many  lab  reagents   (defec6ve  an6bodies  for  example!);   –  You  can’t  freely  share  preliminary  results  with  everyone   because  you  may  (will!)  be  scooped.    •  Pros:   –  Fun  to  see  bioinforma6cs  predic6ons  confirmed  in  the   lab;   –  Nice  collabora6ons;   –  Great  lab  atmosphere.  
  14. 14. •  What:  a  comprehensive  resource  that  complements  SIB/ EBI  Swiss-­‐Prot  human  protein  annota6on  efforts.  We   expect  neXtProt  to  become  a  central  resource  for  human   protein-­‐centric  informa6on;  •  How:     –  by  mining,  in  the  most  appropriate  way  and  with  stringent  quality   criteria,  many  high-­‐throughput  data  resources.      We  plan  to  add  addi6onal  protein/protein  and  protein/small   molecules  interac6ons,  proteomics  data,  pathways/networks   informa6on,  varia6on  data  (such  as  SNP  frequencies),  siRNA   screen  data,  phylogene6c  profiling,  etc.;   –  by  integra6ng  experimental  results  from  an  extensive  network  of   collabora6ng  laboratories.  
  15. 15. Sequence databases Enzyme and pathwayProteomics EMBL databasesHPA IPI BioCycPeptideAtlas PIR BRENDAPRIDE RefSeq Pathway_Interaction_DB Family and domain UniGene Reactome databases Gene3D InterPro2D-gel databases PANTHER PIRSFANU-2DPAGEAarhus/Ghent-2DPAGE In Swiss-Prot users always need to navigate Pfam PRINTSCornea-2DPAGE toward many external resources so as to ProDomDOSAC-COBS-2DPAGE PROSITEHSC-2DPAGE consolidate data into knowledge SMARTOGP TIGRFAMsPMMA-2DPAGEREPRODUCTION-2DPAGESWISS-2DPAGEWorld-2DPAGE UniProtKB/Swiss-Prot Human entries links Miscellaneous ArrayExpressOrganism-specific Bgeedatabases BindingDB CleanExGeneCards dbSNPH-InvDBHGNC In neXtProt the most pertinent data will be DIP DrugBankMIM integrated so as to enable complex queries GOOrphanet HOGENOMPharmGKB HOVERGEN IntAct LinkHub NextBio Genome annotation databases 3D structure Protein family/group databases databases Ensembl GeneID PTM databases DisProt GermOnline KEGG GlycoSuiteDB HSSP MEROPS NMPDR PhosphoSite PDB PeroxiBase PDBsum REBASE SMR TCDB
  16. 16. What  is  not  neXtProt?  •  No,  neXtProt  is  not  a  replacement  for   UniProtKB/Swiss-­‐Prot;  •  No,  neXtProt  is  not  universal  in  coverage,  it  is   intended  to  provide  knowledge  per6nent  to   human  proteins;  •  No,  neXtProt  is  not  a  sequence  resource:  it   uses  the  sequence  data  curated  in  Swiss-­‐Prot.  
  17. 17. When  and  what?  •  In  early  2011  we  released  a  first  public  version  that  contained  in   terms  of  data:   –  All  of  Swiss-­‐Prot  human  data:  sequences  and  annota6ons;   –  Human  Protein  Atlas  (HPA)  organ  and  6ssue  expression   informa6on  from  IHC  (an6bodies);   –  Metadata  on  mRNA  expression  from  microarrays  and  ESTs  from   Bgee  (analyzed  from  ArrayExpress  and  UniGene);   –  Addi6onal  SNPs  from  dbSNP  and  Ensembl;   –  Chromosomal  loca6on  and  exons  mapping  from  Ensembl;   –  Affymetrix  and  Illumina  chip  sets  iden6fiers.  •  In  terms  of  interface,  it  offers:   –  An  intui6ve  query  interface;   –  Many  specialized  views  (func6on,  medical,  expression,  etc);   –  The  possibility  to  tag  and  label  proteins.    
  18. 18. Bronze,  silver  and  gold  •  We  have  a  three-­‐6ered  approach  as  to  data   quality:   –  Bronze:  noisy  or  low  quality  data  that  is  not  imported   in  the  plarorm;   –  Silver:  good  data,  but…..   –  Gold:  data  that  we  believe  to  be  of  a  swiss-­‐(prot)-­‐level   quality.  •  By  default  searches  in  neXtProt  are  carried  out  on   gold  data;  •  Quality  classifica6on  is  a  dynamic  process.  
  19. 19. Query  features  
  20. 20. A  variety  of  views  for  a  single  protein  
  21. 21. An  innova6ve  sequence  viewer  
  22. 22. Informa6on  at  the  genomic  level  
  23. 23. Expression  data  at  mRNA  and  protein  levels    
  24. 24. A  new  proteomics  page  
  25. 25. PTMs  We  are  loading  high-­‐quality  sets  of  PTMs,  star6ng  with  N-­‐glycosyla6on  and  phosphoryla6on  
  26. 26. Pep6de  iden6fica6ons  •  HUPO  brain  and  plasma  project  pep6des  from   Pep6deAtlas;  •  Sets  linked  with  PTMs;  •  Carapito  et  al  mitochondrial  N-­‐terminome   project.  And  to  be  loaded  soon:  •  Other  HUPO  data  sets;  •  Data  from  various  labs  (Vienna,  Geneva,   Roche  (Basel),  Montpellier,  etc.).      
  27. 27. New  subcellular  localiza6on  data  •  From  two  projects:  DKFZ  GFP-­‐cDNA@EMBL  and   WIS  Kahn  Dynamic  Proteomics  db  
  28. 28. Data  export  •  Export  of  data  both  in  XML  and  in  PEFF  formats;  •  neXtProt  is  the  first  resource  to  offer  support  to   the  PSI  PEFF  format;    •  This  enriched  FASTA  format  allows  search   engines  and  other  tools  to  easily  and   consistently  access  data  essen6al  to  the  success   of  HPP,  namely  sequence  varia6ons  and  PTMs.  
  29. 29. Download  by  FTP  •  At  •  To  obtain  downloads  in  XML  or  PEFF;  •  These  files  are  also  available  per  chromosome  as   well  as  ‘report’  files    
  30. 30. What’s  next  in  term  of  tools  •  A  tool  for  the  the  analysis  of  lists  of  proteins    so  as   to  explore  their  enrichment  in  various  types  of   annota6ons,  including  Gene  Ontology  (GO)  terms.  
  31. 31. Programma6c  access  •  We  will  build  an  API  to  allow  third  party  sonware   tools  to  make  use  of  the  data  in  neXtProt;  •  Together  with  BIONEXT,  we  have  obtained  a  grant   to  develop  this  API  and  integrate  a  version  of  their   3D  structure  visualisa6on  tool  in  neXtProt.  
  32. 32. A  note  about  variants  •  There  are  now  over  420’000  variants  loaded  in   neXtProt;  •  The  65’000  from  Swiss-­‐Prot,  the  others  have  been   loaded  from  dbSNP  through  Ensembl;  •  We  will  also  load  the  Cosmic  variants  as  well  as   other  sources.  
  33. 33. We  also  want  to  do  many  other  things  as   quickly  as  possible  but…  
  34. 34. The  road  map:  principles  •  Our  vision  is  to  gradually  build  up  neXtProt,  not   only  by  adding  new  data  resources  but:   –  By  integra6ng  state  of  the  art  data  mining  tools;   –  By  integra6ng  some  forms  of  “social  networking”   func6onali6es  allowing  researchers  to  share  ideas   and  data;   –  By  enabling  the  modeling  of  hypothesis  inside  the   framework  of  the  plarorm.  •  To  work  closely  with  collaborators  and  users  to   define  how  the  data  and  tools  that  we  will   incorporate  into  neXtProt  will  be  useful  for  their   research.  
  35. 35. A  new  resource  for  cell  lines  •  There  are  three  ontologies  catering  for  cell  lines   (MCCL  CLO,  Brenda);  •  A  large  number  of  on-­‐line  catalogs:  ATCC,  CBA,   CCRID,  Coriell,  DSMZ,  ECACC,  ICLC,  IFO,  IZSLER,   JCRB,  RCB,  Riken;  •  There  are  informa6on  resources:  CABRI,  CCLE,   COPE,  HyperCLDB,  Lonza;  •  Databases  storing  cell  lines  as  “samples”:  Cosmic  •  Topical  reviews  on  ‘categories’  of  cell  lines;  •  Various  lists  of  contaminated  cell  lines….   But  there  were  so  far  no  single  resource  pooling   together  all  this  informa6on  in  an  awempt  to  create  a   cell  line  thesaurus..  
  36. 36.  •  Not  an  ontology,  but  a  thesaurus;  •  Links  to  all  the  ontologies,  catalogs,  resources,   publica6ons,  web  sites,  etc.  (over  20’000  Xref);  •  Current  version:  8766  cell  lines.  The  next  version  (May)   will  have  over  10’000  lines,  5’000  synonyms;  •  Scope:  vertebrates  (80%  human,  15%  mouse  and  rat,   the  reminder  are  associated  with  about  100  species;  •  Currently  available  in  a  Swiss-­‐Prot  like  text-­‐based   format  at:      np://  •  But  it  will  soon  also  be  available  in  OBO  format  as  it  has   a  number  of  rela6onships  (derives_from,  etc.);  •  Currently:  no  links  to  6ssues  and  diseases,  but  this  will   be  added  later.  
  37. 37. ID 22Rv1!AC CVCL_1045!SY 22RV1; 22Rv-1; CWR22-Rv1; CWR22R-V1; CWR22Rv1!DR CLO; CLO_0001199!DR CLO; CLO_0001200!DR Brenda; BTO:0002999!DR CLDB; cl7072!DR ATCC; CRL-2505!DR CCLE; 22RV1_PROSTATE!DR CCRID; 3131C0001000700100!DR Cosmic; 924100!DR DSMZ; ACC-438!DR ECACC; 05092802!DR PubMed; 14518029!WW!WW!OX NCBI_TaxID=9606; ! Homo sapiens!HI CVCL_3967 ! CWR22!//!
  38. 38. The  ISB  •  A  young  society  but  already  very  ac6ve:  •  Pros:     –  Over  310  ac6ve  members  from  15  countries;   –  The  interna6onal  mee6ng  (now  yearly);   –  Good  links  to  journals  such  as  Database  and  NAR;   –  Common  projects  such  as  BioDBCore  •  Cons:   –  Not  enough  grass  root  involvements  of  the  members;   –  Not  yet  enough  awareness  of  the  existence  of  the  society   by  would-­‐be  members  in  many  countries  (Eastern  Europe,   South  America,  etc.)  but  also  closer  to  ‘home’  (in  the  US).   Be  more  proacCve!  
  39. 39. Biocura6on  is  an  expanding  field  •  Good  news:   –  Increasing  number  of  biocurators  in  academia  and   industry;   –  More  and  more  knowledge  resources  incorporate   some  amount  of  manual  biocura6on.  •  Bad  news:   –  The  usual  problem  of  long-­‐term  funding  and   sustainability  of  key  resources;   –  A  lot  of  re-­‐inven6ng  the  wheel  as  annota6on  SOPs   are  generally  not  easily  available.  
  40. 40. The  data   flood  •  Yes  it  exists  but…..  •  A  big  propor6on  of  the  data  that  accumulates  today  is  not   going  to  be  useful  in  a  few  years;  •  For  example:  if  we  have  clean  full  length  genome   sequence  of  “all”  representa6ve  species  on  earth  this  is   only  10  petabytes  of  informa6on  (10  million  species  with  1   billion  bp  each);  •  The  genome  of  a  human  being  stored  as  variant  file  is  only   60  Mb  (compressed).  So  storing  the  varia6on  informa6on   for  10  billion  individuals  is  slightly  less  than  1  exabyte  –   not  a  big  challenge  in  term  of  technology  and  price  in   2020;  •  In  the  meanwhile  we  are  s6ll  encapsula6ng  our  most   important  knowledge  using  a  16th  century  technology:  free  
  41. 41. CALIPHO@UniGe_and_SIB   •  neXtProt  content:     –  Coordinator:  Pascale  Gaudet   –  Biocurators:  Guislaine  Argoud-­‐Puy,  Aurore  Britan,  Jonas   Cicenas,  Isabelle  Cusin,  Paula  Duek,  Nevila,  Nouspikel   –  QA:  Monique  Zahn   •  neXtProt  sobware  developers:     –  Olivier  Evalet,  Alain  Gateau,  Anne  Gleizes,  Mario  Pereira,   Catherine  Zwahlen  (and  for  two  years:  Alexandre  Masselot)   •  Laboratory  research:     –  Franck  Bontems,  Marjorie  Desmurs,  Camille  Mary,  Rachel   Porcelli,  Irene  Rossito,  Lisa  Salleron,  Fabiana  Tirone   •  Directed  by:     –  Amos  Bairoch  and  Lydie  Lane  And  we  have  a  posi6on  open  for  a  Java  developer  (will  soon  be  announced  on  the  ISB  web)