Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2,245 views
2,121 views

Published on

"Computing for the Analysis of Genomic data al CRS4" (Chris Jones) presentation at CRS4 Research Center. CRS4 Staff Meeting 24-03-2010 (Pula, Sardinia, Italy)

Published in: Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,245
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

  1. 1. Computing for the Analysis of Genomic Data at CRS4 Chris Jones 24th March 2010 1 giovedì 25 marzo 2010
  2. 2. Who is Chris Jones? Who is Chris Jones? 2 giovedì 25 marzo 2010
  3. 3. Who is Chris Jones? Who is Chris Jones? 2 giovedì 25 marzo 2010
  4. 4. Who is Chris Jones? Who is Chris Jones? • 10 years of particle physics research at Oxford and CERN in Geneva 2 giovedì 25 marzo 2010
  5. 5. Who is Chris Jones? Who is Chris Jones? • 10 years of particle physics research at Oxford and CERN in Geneva • Strong interest in the use of computers to do things, especially science, BETTER 2 giovedì 25 marzo 2010
  6. 6. Who is Chris Jones? Who is Chris Jones? • 10 years of particle physics research at Oxford and CERN in Geneva • Strong interest in the use of computers to do things, especially science, BETTER • The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers 2 giovedì 25 marzo 2010
  7. 7. Who is Chris Jones? Who is Chris Jones? • 10 years of particle physics research at Oxford and CERN in Geneva • Strong interest in the use of computers to do things, especially science, BETTER • The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers • 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility 2 giovedì 25 marzo 2010
  8. 8. Who is Chris Jones? Who is Chris Jones? • 10 years of particle physics research at Oxford and CERN in Geneva • Strong interest in the use of computers to do things, especially science, BETTER • The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers • 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility 2 giovedì 25 marzo 2010
  9. 9. Wellcome Trust Genome Campus 3 giovedì 25 marzo 2010
  10. 10. Wellcome Trust Genome Campus • Escaped on sabbatical to European Bioinformatics Institute – EBI 3 giovedì 25 marzo 2010
  11. 11. Wellcome Trust Genome Campus • Escaped on sabbatical to European Bioinformatics Institute – EBI • Strong links to Sanger Institute 3 giovedì 25 marzo 2010
  12. 12. Wellcome Trust Genome Campus • Escaped on sabbatical to European Bioinformatics Institute – EBI • Strong links to Sanger Institute • And to Roche – Roche Genetics IT Plan 3 giovedì 25 marzo 2010
  13. 13. Wellcome Trust Genome Campus • Escaped on sabbatical to European Bioinformatics Institute – EBI • Strong links to Sanger Institute • And to Roche – Roche Genetics IT Plan • Founded the PRISM Forum 3 giovedì 25 marzo 2010
  14. 14. Wellcome Trust Genome Campus • Escaped on sabbatical to European Bioinformatics Institute – EBI • Strong links to Sanger Institute • And to Roche – Roche Genetics IT Plan • Founded the PRISM Forum 3 giovedì 25 marzo 2010
  15. 15. Why Sequence Genomes? • I hope Francesco has explained that very well • Genomic sequence is the most fundamental information, the starting point, when you look at how living objects work… • And studies of “genotype” versus “phenotype” can bring us an understanding of the origins of disease which has been completely out of reach until now • The technology is just becoming available… 5 giovedì 25 marzo 2010
  16. 16. DNA sequence and genes look like… cacaattacttccacaaatgcagtt gaagcttctactcttcttgcatagg taacctgagtcggagcagttttcct cgtggcttcatctttggtgctggat cttcagcataccaatttgaaggtgc agtaaacgaaggcggtagaggacca agtatttgggataccttcacccata aatatccagaaaaaataagggatgg aagcaatgcagacatcacggttgc 6 giovedì 25 marzo 2010
  17. 17. The Human Genome 7 giovedì 25 marzo 2010
  18. 18. The Human Genome • The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine 7 giovedì 25 marzo 2010
  19. 19. The Human Genome • The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine • It took 15 years for the first human genome sequence 7 giovedì 25 marzo 2010
  20. 20. The Human Genome • The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine • It took 15 years for the first human genome sequence • Which was released between 2003 - 2005 7 giovedì 25 marzo 2010
  21. 21. The Human Genome • The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine • It took 15 years for the first human genome sequence • Which was released between 2003 - 2005 • There are 3*109 or 3 Gigabases in the human genome 7 giovedì 25 marzo 2010
  22. 22. The Human Genome • The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine • It took 15 years for the first human genome sequence • Which was released between 2003 - 2005 • There are 3*109 or 3 Gigabases in the human genome • Pine trees have ~10 times more bases ! Why? 7 giovedì 25 marzo 2010
  23. 23. The Human Genome • The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine • It took 15 years for the first human genome sequence • Which was released between 2003 - 2005 • There are 3*109 or 3 Gigabases in the human genome • Pine trees have ~10 times more bases ! Why? • Do not confuse Gb - bits, GB - Bytes, Gbases (Gb)! 7 giovedì 25 marzo 2010
  24. 24. Genome Analyzer IIx  In Edificio 3  Two GAIIx machines  Each of which:  40 Gbases / run  Paired end reads  4 Gbases / day  but which are complex and forefront technology... 8 giovedì 25 marzo 2010
  25. 25. Genome Analyzer IIx  In Edificio 3  Two GAIIx machines  Each of which:  40 Gbases / run  Paired end reads  4 Gbases / day  but which are complex and forefront technology... 8 giovedì 25 marzo 2010
  26. 26. Genome Analyzer IIx Preparation Workflow Sample Prep Pipeline Analysis 9 giovedì 25 marzo 2010
  27. 27. Genome Analyzer IIx FlowCell  8 Lanes  120 Tiles (2 cols 60 tiles)  4 Pictures per tile (A-T-G-C fluos)  On each tile ~220k clusters 10 giovedì 25 marzo 2010
  28. 28. How much data per run? 11 giovedì 25 marzo 2010
  29. 29. How much data per run? • 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes 11 giovedì 25 marzo 2010
  30. 30. How much data per run? • 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes • * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB) 11 giovedì 25 marzo 2010
  31. 31. How much data per run? • 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes • * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB) • * 2 for the paired end = 5.6 TBytes 11 giovedì 25 marzo 2010
  32. 32. How much data per run? • 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes • * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB) • * 2 for the paired end = 5.6 TBytes • A run of ~1 week on both machines results in 11.2 TeraBytes of image data 11 giovedì 25 marzo 2010
  33. 33. Keeping the raw data? • If we run for ~40 weeks a year we have nearly 0.5 PetaBytes (1 PB = 1015 Bytes or 1 000 000 000 000 000 Bytes) • But if we throw the images away there is no chance to recuperate more Sequence Data from the images when a better (promised) algorithm comes along… • So biology now faces the problem the physicists faced 35 years ago 12 giovedì 25 marzo 2010
  34. 34. Genome Analyzer IIx Cluster generation  Attach single molecules to surface  Amplify to form clusters 103 molecules / µm 2.2·105 molecules/tile 13 giovedì 25 marzo 2010
  35. 35. Genome Analyzer IIx Base Calling • The identity of each base of each cluster is read off from sequential images (cycle by cycle) 15 giovedì 25 marzo 2010
  36. 36. Illumina Pipeline ACTGCTATCTT TCGATTCGTAC TGCTAGGCACC ATCGCATTTCA GGACGTCCTGC TAGGCACCATC GCATCTCCATC 18 giovedì 25 marzo 2010
  37. 37. Experiment Timeline GA IIx Start Day 1 Illumina Pipeline Day 10 BWA and Yun LI workflow Day 13 Quality-Check Tools Day 15 Timing for 115 Cycles Experiment on GA IIx 19 giovedì 25 marzo 2010
  38. 38. How much computing?  A software pipeline has been implemented at CRS4 to perform such operations automatically after a sequencing run ends  40 Gbases per run  370,000,000 sequences  4 samples per flowcell  7,000,000 megabytes of raw data produced per run  5 days for processing sequence-data on the cluster  A huge load for the computer centre 21 giovedì 25 marzo 2010
  39. 39. How much computing? 22 giovedì 25 marzo 2010
  40. 40. Quality Control 23 giovedì 25 marzo 2010
  41. 41. Quality Control  We realised we needed an audit by external experts of how well we were doing (or how badly) 23 giovedì 25 marzo 2010
  42. 42. Quality Control  We realised we needed an audit by external experts of how well we were doing (or how badly)  We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK 23 giovedì 25 marzo 2010
  43. 43. Quality Control  We realised we needed an audit by external experts of how well we were doing (or how badly)  We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK  We developed a Quality check process: − Qualitative and quantitative evaluation of illumina summary file parameters − Evaluation of sequence quality (avg. number of “blank” base calls) − Evaluation of coverage / holes − Evaluation of known/all SNPs found ratio 23 giovedì 25 marzo 2010
  44. 44. Quality Control  We realised we needed an audit by external experts of how well we were doing (or how badly)  We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK  We developed a Quality check process: − Qualitative and quantitative evaluation of illumina summary file parameters − Evaluation of sequence quality (avg. number of “blank” base calls) − Evaluation of coverage / holes − Evaluation of known/all SNPs found ratio • This has been very successful 23 giovedì 25 marzo 2010
  45. 45. Quality Check: – Weekly Team Meeting  Qualitative and quantitative evaluation of illumina summary file parameters: − Based on Sanger QC protocol − Quantitative examination of run results − Qualitative inspection of plots 24 giovedì 25 marzo 2010
  46. 46. Summary of results  In October 2008 we foresaw 6 Gbases per run per machine  We started at the end of February 2009  We started a Quality Control initiative in Sept. 2009  We have continuously improved number of bases per run:  Upgrades of machines  Preparation of samples (reagents, PCR)  Increasing number of cycles  New algorithms for image processing and base-calling – better alignment software  Quality control 27 giovedì 25 marzo 2010
  47. 47. 28 giovedì 25 marzo 2010
  48. 48. Activity summary - statistics  67 samples sequenced and aligned  6 samples actually running on the GAs  Average coverage of samples 2.98X  ~800 Gbases of raw data  ~590 Gbases of aligned data 30 giovedì 25 marzo 2010
  49. 49. Imputation • Program from Gonçalo Abecasis and Serena Sanna • Very powerful tool in the analysis of population genetics • Extrapolate measured data to infer more genomic variations that you have not measured • Excellent e-Science, use the computer to do better science • This certainly merits a seminar to itself 31 giovedì 25 marzo 2010
  50. 50. Plans and Visions • Illumina has announced its latest sequencers, which will measure 200 Gbases in a run of 8 days • 5 times our current performance in 20% less time • Easy to predict 400 or 600 Gbases, – 10 to 15 times as much data per run • For the plans to sequence 2000 Sardinians together with NIH and with University at Ann Arbor, and also for other requests from the Park and from Sardinia, we would like to acquire some of these new machines 32 giovedì 25 marzo 2010
  51. 51. My personal view 33 giovedì 25 marzo 2010
  52. 52. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage 33 giovedì 25 marzo 2010
  53. 53. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage • It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world, 33 giovedì 25 marzo 2010
  54. 54. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage • It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world, • and which ultimately cost Sardinia (and the rest of humanity) a lot of money 33 giovedì 25 marzo 2010
  55. 55. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage • It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world, • and which ultimately cost Sardinia (and the rest of humanity) a lot of money • It is driven by a predominantly Sardinia team doing excellent work 33 giovedì 25 marzo 2010
  56. 56. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage • It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world, • and which ultimately cost Sardinia (and the rest of humanity) a lot of money • It is driven by a predominantly Sardinia team doing excellent work • It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility 33 giovedì 25 marzo 2010
  57. 57. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage • It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world, • and which ultimately cost Sardinia (and the rest of humanity) a lot of money • It is driven by a predominantly Sardinia team doing excellent work • It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility • If we don’t do this now we will lose a golden opportunity for ever 33 giovedì 25 marzo 2010
  58. 58. My personal view • This is an opportunity for Sardinia to play frontier science on a world stage • It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world, • and which ultimately cost Sardinia (and the rest of humanity) a lot of money • It is driven by a predominantly Sardinia team doing excellent work • It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility • If we don’t do this now we will lose a golden opportunity for ever • Where else would you set up such a Facility? 33 giovedì 25 marzo 2010
  59. 59. Thank you for your attention! 34 giovedì 25 marzo 2010

×