CALIPHO: two missions for one goal:increasing our knowledge on humanproteinsAmos Bairoch April 4, 2012
Computer Analysis and Laboratory Inves6ga6on of Proteins of Human Origin Its two missions: Carry out laboratory experiments on selected sets of uncharacterized human proteins to discover their func6on Develop neXtProt, an ambi6ous new knowledge resource centered around human proteins
‘The’ human genome • Sequencing a human genome is no longer a technological challenge; • Making sense of what it tells us is s6ll much more problema6c then anyone ever expected.
Almost 12 years ago, at the 4thSiena meeting, we proposed toannotate in Swiss-Prot all thehuman proteins
Some stats on human proteins from UniProtKB/Swiss-‐Prot • 20’244 reviewed entries (~protein-‐coding genes); • 16’000 addi6onal isoforms in about 8’100 entries (40% but will probably rise to >60%): 50’000 diﬀerent protein sequences; • 65’000 variants; 22’500 linked to diseases; the rest are SNPs that are SAPs (2 per proteins). This is the 6p of the iceberg; • 80’000 PTMs (50% of which are experimental). This is the 6p of the 6p of the iceberg!
Some issues about protein-‐coding genes • We completely agree with what was shown earlier in this mee6ng by the HAVANA group: that there are slightly less then 20K protein-‐coding genes; • Many weirdos in the genome: bicistronic mRNAs, genes that produce through splicing proteins with no sequence rela6onship, mul6ple genes for the same protein, etc; • Varia6on in term not only of SNPs but of copy number. And some segrega6ng pseudogenes (olfactory receptors); • How many have been proven at protein level? – Using the protein evidence “metric” used at UniProt and neXtProt, we are now at about 70%; – But if we were hun6ng everywhere in good-‐quality MS data, it would rise to about 85%. The big issue in proteomics is how to hunt for the last 15%
From genome to proteome ~ 20’000 protein ~ 5000000 coding-genes different proteins post-translational modifications of proteinsalternative splicing (PTMs) of mRNA 50-100 fold increase 2-5 fold increase ~ 50 to 100’000 transcripts (mRNAs) Protein complexity
The complexity of life and of its molecular actors is fractal
Many human proteins for which we lack func6onal knowledge 1. Similar to characterized proteins in distant organisms (bacteria, plants, yeast), but no valida6on in mammals; 2. Presence of domains that help predict a ‘general’ func6on but not a precise one (examples: hydrolase fold, GPCR); 3. Presence of domains or sequence features that help deﬁne some proper6es (examples: PDZ -‐> PPI, many TMs -‐> integral membrane protein); 4. “Orphan”. With no similarity to any characterized proteins but that can be conserved across a more or less wide taxonomic space. About 5’000 human proteins are in one of the above four categories
Overview of the CALIPHO wet lab strategy In silico selecCon : sequence analysis, phylogeny, data mining Tissue/cell line expression (RT-‐PCR) Cloning of cDNA in the Gateway system Yeast two hybrid Subcellular locaCon in HeLa cells Recombinant protein (confocal imaging) producCon in E.Coli ValidaCon of protein-‐protein interacCons (GST pull down, co-‐IP) 3D structure by NMR Data mining, Modelling Hypothesis generaCon FuncConal assays on cell lines (RNAi) In vivo validaCon (animal models eg zebraﬁsh) CALIPHO@UniGe collaborators CALIPHO@SIB
Aner 2.5 years… • A protein involved in ciliogenesis; • An enzyme involved in a salvage pathway not yet characterized in vertebrates; • A myristoylated and palmitoylated protein that could be involved in membrane blebbing; • A mitochondrial protein that may play a role in a Mt import mechanism.
Personal view • Cons: – It takes much longer than what you expect or want! And magic and luck seem to be the most important factors in successful experiments! – The low ra6o of quality/cost for many lab reagents (defec6ve an6bodies for example!); – You can’t freely share preliminary results with everyone because you may (will!) be scooped. • Pros: – Fun to see bioinforma6cs predic6ons conﬁrmed in the lab; – Nice collabora6ons; – Great lab atmosphere.
• What: a comprehensive resource that complements SIB/ EBI Swiss-‐Prot human protein annota6on eﬀorts. We expect neXtProt to become a central resource for human protein-‐centric informa6on; • How: – by mining, in the most appropriate way and with stringent quality criteria, many high-‐throughput data resources. We plan to add addi6onal protein/protein and protein/small molecules interac6ons, proteomics data, pathways/networks informa6on, varia6on data (such as SNP frequencies), siRNA screen data, phylogene6c proﬁling, etc.; – by integra6ng experimental results from an extensive network of collabora6ng laboratories.
Sequence databases Enzyme and pathwayProteomics EMBL databasesHPA IPI BioCycPeptideAtlas PIR BRENDAPRIDE RefSeq Pathway_Interaction_DB Family and domain UniGene Reactome databases Gene3D InterPro2D-gel databases PANTHER PIRSFANU-2DPAGEAarhus/Ghent-2DPAGE In Swiss-Prot users always need to navigate Pfam PRINTSCornea-2DPAGE toward many external resources so as to ProDomDOSAC-COBS-2DPAGE PROSITEHSC-2DPAGE consolidate data into knowledge SMARTOGP TIGRFAMsPMMA-2DPAGEREPRODUCTION-2DPAGESWISS-2DPAGEWorld-2DPAGE UniProtKB/Swiss-Prot Human entries links Miscellaneous ArrayExpressOrganism-specific Bgeedatabases BindingDB CleanExGeneCards dbSNPH-InvDBHGNC In neXtProt the most pertinent data will be DIP DrugBankMIM integrated so as to enable complex queries GOOrphanet HOGENOMPharmGKB HOVERGEN IntAct LinkHub NextBio Genome annotation databases 3D structure Protein family/group databases databases Ensembl GeneID PTM databases DisProt GermOnline KEGG GlycoSuiteDB HSSP MEROPS NMPDR PhosphoSite PDB PeroxiBase PDBsum REBASE SMR TCDB
What is not neXtProt? • No, neXtProt is not a replacement for UniProtKB/Swiss-‐Prot; • No, neXtProt is not universal in coverage, it is intended to provide knowledge per6nent to human proteins; • No, neXtProt is not a sequence resource: it uses the sequence data curated in Swiss-‐Prot.
When and what? • In early 2011 we released a ﬁrst public version that contained in terms of data: – All of Swiss-‐Prot human data: sequences and annota6ons; – Human Protein Atlas (HPA) organ and 6ssue expression informa6on from IHC (an6bodies); – Metadata on mRNA expression from microarrays and ESTs from Bgee (analyzed from ArrayExpress and UniGene); – Addi6onal SNPs from dbSNP and Ensembl; – Chromosomal loca6on and exons mapping from Ensembl; – Aﬀymetrix and Illumina chip sets iden6ﬁers. • In terms of interface, it oﬀers: – An intui6ve query interface; – Many specialized views (func6on, medical, expression, etc); – The possibility to tag and label proteins.
Bronze, silver and gold • We have a three-‐6ered approach as to data quality: – Bronze: noisy or low quality data that is not imported in the plarorm; – Silver: good data, but….. – Gold: data that we believe to be of a swiss-‐(prot)-‐level quality. • By default searches in neXtProt are carried out on gold data; • Quality classiﬁca6on is a dynamic process.
PTMs We are loading high-‐quality sets of PTMs, star6ng with N-‐glycosyla6on and phosphoryla6on
Pep6de iden6ﬁca6ons • HUPO brain and plasma project pep6des from Pep6deAtlas; • Sets linked with PTMs; • Carapito et al mitochondrial N-‐terminome project. And to be loaded soon: • Other HUPO data sets; • Data from various labs (Vienna, Geneva, Roche (Basel), Montpellier, etc.).
New subcellular localiza6on data • From two projects: DKFZ GFP-‐cDNA@EMBL and WIS Kahn Dynamic Proteomics db
Data export • Export of data both in XML and in PEFF formats; • neXtProt is the ﬁrst resource to oﬀer support to the PSI PEFF format; • This enriched FASTA format allows search engines and other tools to easily and consistently access data essen6al to the success of HPP, namely sequence varia6ons and PTMs.
Download by FTP • At np.nextprot.org • To obtain downloads in XML or PEFF; • These ﬁles are also available per chromosome as well as ‘report’ ﬁles
What’s next in term of tools • A tool for the the analysis of lists of proteins so as to explore their enrichment in various types of annota6ons, including Gene Ontology (GO) terms.
Programma6c access • We will build an API to allow third party sonware tools to make use of the data in neXtProt; • Together with BIONEXT, we have obtained a grant to develop this API and integrate a version of their 3D structure visualisa6on tool in neXtProt.
A note about variants • There are now over 420’000 variants loaded in neXtProt; • The 65’000 from Swiss-‐Prot, the others have been loaded from dbSNP through Ensembl; • We will also load the Cosmic variants as well as other sources.
We also want to do many other things as quickly as possible but…
The road map: principles • Our vision is to gradually build up neXtProt, not only by adding new data resources but: – By integra6ng state of the art data mining tools; – By integra6ng some forms of “social networking” func6onali6es allowing researchers to share ideas and data; – By enabling the modeling of hypothesis inside the framework of the plarorm. • To work closely with collaborators and users to deﬁne how the data and tools that we will incorporate into neXtProt will be useful for their research.
A new resource for cell lines • There are three ontologies catering for cell lines (MCCL CLO, Brenda); • A large number of on-‐line catalogs: ATCC, CBA, CCRID, Coriell, DSMZ, ECACC, ICLC, IFO, IZSLER, JCRB, RCB, Riken; • There are informa6on resources: CABRI, CCLE, COPE, HyperCLDB, Lonza; • Databases storing cell lines as “samples”: Cosmic • Topical reviews on ‘categories’ of cell lines; • Various lists of contaminated cell lines…. But there were so far no single resource pooling together all this informa6on in an awempt to create a cell line thesaurus..
• Not an ontology, but a thesaurus; • Links to all the ontologies, catalogs, resources, publica6ons, web sites, etc. (over 20’000 Xref); • Current version: 8766 cell lines. The next version (May) will have over 10’000 lines, 5’000 synonyms; • Scope: vertebrates (80% human, 15% mouse and rat, the reminder are associated with about 100 species; • Currently available in a Swiss-‐Prot like text-‐based format at: np://np.nextprot.org/ • But it will soon also be available in OBO format as it has a number of rela6onships (derives_from, etc.); • Currently: no links to 6ssues and diseases, but this will be added later.
The ISB • A young society but already very ac6ve: • Pros: – Over 310 ac6ve members from 15 countries; – The interna6onal mee6ng (now yearly); – Good links to journals such as Database and NAR; – Common projects such as BioDBCore • Cons: – Not enough grass root involvements of the members; – Not yet enough awareness of the existence of the society by would-‐be members in many countries (Eastern Europe, South America, etc.) but also closer to ‘home’ (in the US). Be more proacCve!
Biocura6on is an expanding ﬁeld • Good news: – Increasing number of biocurators in academia and industry; – More and more knowledge resources incorporate some amount of manual biocura6on. • Bad news: – The usual problem of long-‐term funding and sustainability of key resources; – A lot of re-‐inven6ng the wheel as annota6on SOPs are generally not easily available.
The data ﬂood • Yes it exists but….. • A big propor6on of the data that accumulates today is not going to be useful in a few years; • For example: if we have clean full length genome sequence of “all” representa6ve species on earth this is only 10 petabytes of informa6on (10 million species with 1 billion bp each); • The genome of a human being stored as variant ﬁle is only 60 Mb (compressed). So storing the varia6on informa6on for 10 billion individuals is slightly less than 1 exabyte – not a big challenge in term of technology and price in 2020; • In the meanwhile we are s6ll encapsula6ng our most important knowledge using a 16th century technology: free
CALIPHO@UniGe_and_SIB • neXtProt content: – Coordinator: Pascale Gaudet – Biocurators: Guislaine Argoud-‐Puy, Aurore Britan, Jonas Cicenas, Isabelle Cusin, Paula Duek, Nevila, Nouspikel – QA: Monique Zahn • neXtProt sobware developers: – Olivier Evalet, Alain Gateau, Anne Gleizes, Mario Pereira, Catherine Zwahlen (and for two years: Alexandre Masselot) • Laboratory research: – Franck Bontems, Marjorie Desmurs, Camille Mary, Rachel Porcelli, Irene Rossito, Lisa Salleron, Fabiana Tirone • Directed by: – Amos Bairoch and Lydie Lane And we have a posi6on open for a Java developer (will soon be announced on the ISB web)