Getting genomics and proteomics data to work together - Jason Wong

757 views

Published on

Genomics and proteomics are closely related fields of research. An understanding of one is generally required for the other, yet in many ways, the methods used to study the two cannot be more different. With the emergence of massive parallel sequencing vast quantities of genomics and transcriptomics data are being generated. At the same time, improvements in mass spectrometry technologies are enabling proteins to be identified with greater specificity and sensitivity. This now provides new opportunities to investigate ways to integrate genomics and proteomics data and understand how the two can complement each other to advance biological knowledge. Using HeLa cells as a model system, we have comprehensively examined the gene models derived from genomics and transcriptomics data and integrated these with proteomics and phosphoproteomics datasets. Reanalysis of proteomics data using HeLa specific gene models enable significant increases in the number of peptides/proteins to be identified, providing new insights into both the genome and proteome of HeLa cells. Technical challenges and methods required for integrating genomics and proteomics data will also be discussed. In summary, given that massive parallel sequencing data are now available for many popular cell lines in public data repositories, our study provides further support for the need and benefit of an integrative data analysis for both genome and proteome analysis.

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
757
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Getting genomics and proteomics data to work together - Jason Wong

  1. 1. Getting genomics and proteomics data to work together Dr Jason Wong Prince of Wales Clinical School
  2. 2. State of the art in proteomics Proteomics can now be use to identify and quantify tens of thousands of proteins in a single experiment. Nagaraj et al Mol Sys Biol 2011 HeLa cells: 10,255 proteins identified Zhou et al Nat. Comm. 2013 mESC: 11,352 proteins identified Mertins et al Nat. Met. 2013 Jurkat cells: 7,897 proteins
  3. 3. Challenges of proteomics • Experimental perspective • Obtaining sufficient sample • Sample preparation • Dynamic range • Computational perspective • Risk of false positive identification • General methods only identifies known proteins Nonpeptide/low quality spectra (~50%) Annotated spectra (~30%) High quality potentially annotatable spectra (~20%) Wong et al BMC Bioinf 2007
  4. 4. Genomics and transcriptomics Analysis of DNA/RNA does not have many of the limitations of proteomics, especially with the emergence of next-generation sequencing (NGS). •Sample quantity less of an issue when analysing DNA/RNA. •Very large dynamic range with NGS. •Relatively simple sequence-based data analysis. Next-generation sequencing has allowed the discovery of: 1.Single nucleotide variants/Indels (Exome-seq/RNA-seq) 2.Novel splice variants (RNA-seq) 3.Novel proteins (Ribosome profiling) However, in order to understand the functional importance of coding genes, it is still essential to study them at the protein level.
  5. 5. Single nucleotide variants – Jurkat cells Datasets Experiment Details Reference Exome-seq ~ 150 M, 100 bp PE reads Broad Institute, CCLE RNA-seq ~ 100 M, 100 bp PE reads Sheynkman et al MCP (2013) Proteomics deep ~ 0.5 M spectra Sheynkman et al MCP (2013) Proteomics ultra-deep ~ 2.5 M spectra Mertins et al Nat Meth (2013) Proteomics ultra-deep PTM pSTY - ~ 0.85 M spectra (ac)K - ~ 0.35 M spectra (ubi)K - ~ 0.36 M spectra Mertins et al Nat Meth (2013)
  6. 6. Searching for peptides with SNVs Exome/RNA-seq Proteomics BAM Mass spectra GATK/samtools Search using MaxQuant VCF Refseq ANNOVA Annotated variants + Python scripts Variant peptides Annotated mass spectra
  7. 7. Variants – Overlap between Exome- and RNA-seq Exome-seq Non-synonymous variants 8232 RNA-seq 4584 1975 Almost 70% of RNA-seq n.s. variants overlap with Exome-seq.
  8. 8. Variants – Overlap with proteomics data RNA-seq based variants validated by mass spectrometry Variant peptides Total peptides Mertins 987 156,606 Sheynkman 448 RNA-seq (Total variants 6559) 75,878 638 Mertins dataset • 99 Sheynkman dataset Suggests that RNA-seq may be more suited for finding variants in proteomics data. 525 • 349 However may also just be just due to data quality issues. 290 81 Exome-seq (Total variants 12816)
  9. 9. Validation of peptide identifications r2=0.19 Mertins dataset Variant peptides Reference peptides Heterozygous 673 465 Homozygous 314 4 V a r ia n t a b u n d a n c e (lo g 1 0 ) 9 8 7 6 5 4 4 5 6 7 8 R e fe r e n c e a b u n d a n c e (lo g 1 0 ) Chr chr10 chr10 chr11 chr19 Pos 51363659 71906150 67414492 3492265 Ref A T C G Alt G C T A Zygosity Qual Depth Depth Alt Func.refGene 1 222 222 200 exonic 1 44.8 2 2 exonic 1 43.8 2 2 exonic 1 59 3 3 exonic Gene.refGene PARG TYSND1 ACY3 DOHH 9
  10. 10. R e a d d e p th re n p p p e e e p p p ti ti ti d d d e e e s s s A ll v a r ia n ts ll e t V a r ia n t s id e n t if ie d b y M S A c n 0 .0 fe a 0 .1 e ri 0 .2 M a x Q u a n t s c o re 0 .4 R a 0 0 .3 V 0 -5 -1 0 0 0 00 1 5 15 0 - 0 2 0 20 0 - 0 2 5 25 0 - 0 3 0 30 0 - 0 3 5 35 0 - 0 4 0 40 0 - 0 4 5 45 0 - 0 5 0 50 0 - 0 5 5 55 0 - 0 6 0 60 0 - 0 6 5 65 0 - 0 7 0 70 0 - 0 7 5 75 0 - 0 8 0 80 0 - 0 8 5 85 0 - 0 9 0 90 0 9 - 0 5 0 95 -1 0 0 0 > 0 1 0 0 0 1 5 % v a r ia n ts Validation of peptide identifications 0 .5 n .s . 250 n .s . 200 150 100 50 0
  11. 11. Variants in application to PTMs Variant peptides Phosphorylation STY(p) Acetylation K(Ac) Ubiquitination K(GG) Total peptides 357 64067 2 5805 172 38454 How does the variant affect phosphorylation? (1) (1) Variant residue not affecting phosphorylation site 95% (339) (2) (2) Variant residue is phosphorylation site 3.4% (12) (3) Variant residue may influence phosphorylation 1.6% (6) EILpSPQ(W/C)Y EIL(A/pS)PQWY (3) EIL(p)S(G/P)QWY
  12. 12. Phosphorylation sites directly affected by variants Gene GBF1 LINS TADA2A TCF3 USE1 ATP5SL TANC1 SF3B1 DDX27 FAM114A2 SPDL1 GPANK1 PARG KIAA0586 DMXL2 GGA2 PFAS ZNF235 Refseq NM_001199378 NM_001040616 NM_001166105 NM_001136139 NM_018467 NM_001167867 NM_001145909 NM_001005526 NM_017895 NM_018691 NM_017785 NM_033177 NM_003631 NM_001244193 NM_001174116 NM_015044 NM_012393 NM_004234 Variant p.G1690S p.R680S p.P6S p.A8S p.L154S p.N40S p.N250S p.A86T p.G766S p.G122S p.L586S p.A78S p.L138P p.L703P p.S1288P p.A424P p.L621P p.H296P Peptide GGSPSALWEITWER EFSLEPPSSPLVLK LGSFSNDPSDKPPCR MSPVGTDKELSDLLDFSMMFPLPVTNGK TGVAGSQPVSEKQSAAELDLVLQR LGAAVAPEGSQKK SGSSLEWNKDGSLR KPGYHTPVALLNDIPQSTEQYDPFAEHRPPK QYRASPSFEER AETSLGIPSPSEISTEVK SHPILYVSSK IMKSPAAEAVAEGASGR LENVSQLSLDKSPTEK EASPPPVQTWIK FGDTEADSPNAEEAAMQDHSTFK NLLDLLSPQPAPCPLNYVSQK NGQGDAPPTPPPTPVDLELEWVLGK SPACSTPEKDTSYSSGIPVQQSVR SNP ID SIFT score Polyphen score rs11191274 0.75 0.703 rs8451 0.36 0.023 rs7211875 1 0 rs147133056 0.05 0.997 rs414528 0.82 0 rs2231940 0.82 0.018 rs12466551 1 0 NA 0.01 0.955 rs1130146 0.37 0.983 rs2578377 1 0 rs3777084 0.89 0 NA 0.72 0.005 rs4412715 NA NA rs1748986 0.22 0.001 rs12102203 0.21 0 rs1135045 0.48 0 rs11078738 1 0 rs2125579 0.02 0.001 Creation of phospho-site (XS/T) Creation of MAPK motif (S/TX  S/TP)
  13. 13. Splicing factor 3B subunit 1 (SF3B1) • Part of the RNA splicing machinery • Frequently mutated in myleodysplasia (~20%) and other leukaemias A86T located in exon3 • A86T has been identified previously in one lung cancer sample from TCGA. • SF3B1 is a phosphoprotein and phosphorylation of SF3B1 is known to be important for the assembly of the RNA splicing machinery.
  14. 14. Using mass spectrometry to discover new proteins Prediction of alternative ORFs from RNA-seq • 86 alternative ORFs • 57% non-AUG translation initiation **** 0 .0 6 **** **** 0 .0 4 0 .0 2 a D a k D 0 k 1 0 < 1 a < L a H e L e F rO O a rO a L e H H R F R F R F R O a a L e H e S e ru ru m m a O rO R R F F 0 .0 0 S P o s te r io r E r r o r P r o b a b ility (P E P ) Computational prediction of alternative ORFs • Only AUG translation initiation • 1,259 alternative ORFs!
  15. 15. Ribosome profiling • Sequence only mRNA protected by ribosomes • Results in identification of mRNA that is being translated • Use of translation inhibitors enables discovery of newly ORFs Ingolia et al. Cell 2011
  16. 16. Application to HEK293T cells Lee et al PNAS 2012 o Reported 12,814 alternative ORFs in refseq genes o Using their annotation, constructed database of proteins arising from these alternative ORFs. o Searched against publically available HEK293T datasets • Geiger et al MCP 2012 , ~0.5 million spectra
  17. 17. Identified novel proteins RefSeq Accession Gene Symbol Relative to rTIS Annotation Frame ORF length Codon Peptide count NM_019008 SMCR7L -311 5'UTR 1 213 ATG 3 NM_080670 SLC35A4 -719 5'UTR 1 312 ATG 2 NM_001142726 C1orf122 -609 5'UTR 0 753 TTG 2 FXR2 -219 5'UTR 0 2241 GTG 2 NM_004860 (11 novel ORFs in total) NM_019008_SMCR7L_188_5'UTR (chr22:39,900,236-39,900,445) NM_080670_SLC35A4_10_5'UTR (chr5:139,944,429-139,946,345) P e r c e n ta g e o f r e g io n s 0 .4 5 'U T R C o d in g 0 .3 0 .2 0 .1 0 .0 0 .0 0 .2 0 .4 0 .6 C o n s e r v a tio n 0 .8 1 .0
  18. 18. Conclusion and next steps Variants •Analysis indels and more datasets Novel proteins •Develop methods to analyse ribosome profiles in unannotated genomic regions. • Generate HEK293T peptidomics (< 10kDa) dataset In terms of bioinformatics: •Automate integration of transcriptomics and proteomics data. •Methods to visualise identified peptides on genome browsers. •Develop new MS-based search methods of directly finding variant peptides.
  19. 19. Acknowledgements Bioinformatics and Integrative genomics team • • • • • • • • Dr Ranjeeta Menon Dr Dominik Beck John Ng Jackie Huang Kate Guan Felix Ma Dilmi Perera Diego Chacon Carnegie Institution for Science, Baltimore, USA • Dr Nicolas Ingolia Funding:

×