Population-scale high-throughput sequencing data analysis


Published on

Unprecedented computational capabilities and high-throughput data collection methods promise a new era of personalised, evidence-based healthcare, utilising individual genomic profiles to tailor health management as demonstrated by recent successes in rare genetic disorders or stratified cancer treatments. However, processing genomic information at a scale relevant for the health-system remains challenging due to high demands on data reproducibility and data provenance. Furthermore, the necessary computational requirements requires a large investment associated with compute hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks, like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility.
To cater for this resource-hungry, fast pace yet sensitive environment of personalized medicine, we developed NGSANE, a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance when processing raw sequencing data either on a local cluster or Amazon’s Elastic Compute Cloud (EC2).

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Staff # as at 30 June 2012 = 6492 (FTE = 5720)
    2011-12 budget = $1.2billion

    Some specifics about us:
    CSIRO is Australia’s national science agency. We are a mission-directed, large-scale, multidisciplinary research and development organisation.
    Since 1926, we have been in the business of applying scientific knowledge to the big issues facing Australia and increasingly the world. Globally we are recognised as one of the top 10 applied research organisations.
    We bring together the best scientists in the world and teams of professionals to work together to help create industries, national wealth, a healthy environment and improved living standards.
    We have delivered many innovations that have positively impacted on the daily lives of Australians and billions of others around the world.
    In terms of our vital statistics we generate annual revenues of over A$1 billion . We have around 6,500 people in more than 50 locations across Australia. We lead 11 National Research Flagships addressing major challenges like water, climate, health, manufacturing, mineral resources. This includes our two new flagships: Biosecurity, Digital Productivity and Services (June 2012). We are a leading Australian patenting organisation with over 3,500 patents (granted and pending) and manage an IP portfolio of over 150 revenue bearing licenses.
    While our research enjoys a high global ranking in terms of publication and citation rates it’s our focus on creating positive impact from science at scale that sets us apart from others in the Australian innovation system.
    We do science with purpose. We do it well. We make a difference.
  • CSIRO operates in a matrix. This is to ensure we have the flexibility we need to be able to provide the right mix of skills and talent for major projects; pulling the right people from all across the organisation to form multidisciplinary teams. We understand that often the most successful science comes from crossing boundaries, and working in a matrix structure gives us the ability to do this and therefore help us to deliver impact for Australia.
    We organise ourselves into 5 research groups. Within these we have 11 National Research Flagships (the two most recent being approved in June 2012 – Biosecurity, Digital Productivity and Services), plus core research portfolios, and 12 research Divisions as at 1 July 2012. We also have Transformational Capability Platforms and several national research facilities and collections.
    There are equivalent organisations to CSIRO in a number of countries like India and South Africa albeit with slightly narrower missions. CSIRO’s research spans agriculture, food, manufacturing, materials, energy, minerals, health, ICT and the climate, water and environmental domains.
    This is important to understand because increasingly the solutions to the major challenges we face are being found across sectors and at the interface of different scientific disciplines. So one of the hidden benefits of being large and multidisciplinary that we have discovered, is that you can more readily assemble the teams and partnerships necessary to deliver the scale of impact required to address the big questions facing humanity.
  • While we are a research organization, we were successful at commercializing a couple of our products. Most famously the wifi protocol which is now in every device using wireless technology like your laptop or phone. Closer to my area of research is Barlymax, a cereal which is high in fibre specifically developed to reduce the risk of bowel cancer.

    We have a strong track record of commercial success. Our work has impacted the daily lives of Australians and those around the world. These are some of our top inventions.
  • 94.45+24.57+(7.02e+00)+2.25

  • There is more than genome sequencing

    Protocols evolve > pipelines need to be adjusted too
  • Process data in streams
    Command line efficiency
    With scale comes structure
  • Currently used for processing of biological data but can be adapted to work with other data sources
  • http://www.nscblog.com/miscellaneous/419/
  • Population-scale high-throughput sequencing data analysis

    1. 1. Denis C. Bauer | Bioinformatics | @allPowerde 08 July 2014 CSIRO COMPUTATIONAL INFORMATICS Population-scale high-throughput sequencing data analysis ByMelody
    2. 2. Talk Overview 2 | • Background: CSIRO/Omics Project • Methods: NGS Data Processing on HPC/Cloud • Research Outcome: Cancer and Microbes in Colorectal Cancer Denis Bauer | @allPowerde
    3. 3. 62% of our people hold university degrees 2000 doctorates 500 masters With our university partners, we develop 650 postgraduate research students Top 1% of global research institutions in 14 of 22 research fields Top 0.1% in 4 research fields Darwin Alice Springs Geraldton 2 sites Atherton Townsville 2 sites Rockhampton Toowoomba Gatton Myall Vale Narrabri Mopra Parkes Griffith Belmont Geelong Hobart Sandy Bay Wodonga Newcastle Armidale 2 sites Perth 3 sites Adelaide 2 sites Sydney 5 sites Canberra 7 sites Murchison Cairns Irymple Melbourne 5 sites CSIRO: Who we are Werribee 2 sites Brisbane 6 sites Bribie Island People Divisions Locations Flagships Budget 6500 13 58 11 $1B+ The Commonwealth Scientific and Industrial Research Organisation Denis Bauer | @allPowerde3 |
    4. 4. Our business units 12Research Divisions11National Research Flagships +National Research Facilities and Collections FOOD, HEALTH & LIFE SCIENCE INDUSTRIES ENVIRONMENT MANUFACTURING, MATERIALS & MINERALS ENERGY INFORMATION & COMMUNICATIONS +Transformational Capability Platforms Denis Bauer | @allPowerde4 |
    6. 6. Part 1: The ‘omics project The goalof the project is to investigate the susceptibility to colorectal cancer in the context of obesity and the gut microbiome Denis Bauer | @allPowerde6 |
    7. 7. Data from Pilot Study Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith) organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle) Denis Bauer | @allPowerde7 |
    8. 8. • Objective: capture genomic variances reliably in tumour normal and adipose. • Sequence effort: • 12 tumour -> 6 lanes (2-plex) • 12 normal -> 3 lanes (4-plex) • 12 adipose -> 3 lanes (4-plex) Considerations before sequencing: Undersampling More depth needed due to potentially low cellularity in the tumour sample additional depth tumour sample normal sample Denis Bauer | @allPowerde8 |
    9. 9. • Objective: process samples avoiding confounding factors Considerations before sequencing: Flowcell design L1 L2 L2 L2 O1 O1 O1 O2 O2 O2 Sequenced over 3 lanes L1 L1 Normal Adipose Tumour 4-plex 4-plex 4-plex L2 O2 L1 O1 L2 O2 L1 O1 Sequence on one lane each L2 O2 L1 O1 Subject every sample to the same lane and flowcell effects by multiplexing (labelling every sample with a identifying barcode) Denis Bauer | @allPowerde9 |
    10. 10. • Population-scale sequencing with more samples than illumina-barcodes: imbalanced flowcell design will split samples and pair the halves with different partners (e.g. LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 ) Considerations for Omics Proj.: Flowcell design L1.1 L1.1 O1.1 O1.1 O1.1 L1.1 Normal Adipose Tumour L2.1 L2.1 L2.1 O2.1 O2.1 O2.1 L3.1 L3.1 L3.1 O3.1 O3.1 O3.1 L4.1 L4.1 L4.1 O4.1 O4.1 O4.1 Lane1 Lane2 Lane3 Lane4 L1.2 L1.2 O3.2 O3.2 O3.2 L2.2 L2.2 L2.2 O4.2 O4.2 O4.2 L3.2 L3.2 L3.2 O1.2 O1.2 O1.2 L4.2 L4.2 L4.2 O2.2 O2.2 O2.2 Lane5 Lane6 Lane7 Lane8 L1.2 4-plex 4-plex 2-plex L=Lean O=Obese L1.1=Lean individual 1 part 1 (of 2) ... 12 Lanes Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781 Denis Bauer | @allPowerde10 |
    11. 11. Blue Monster says Design your experiment with project- specific pitfalls in mind Auer PL et al. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781 Denis Bauer | @allPowerde11 |
    12. 12. Part 2: NGS Data Processing Minimize project set-up overhead while providing easily adaptable processing modules for NGS analysis on high-performance- compute clusters/cloud Denis Bauer | @allPowerde12 |
    13. 13. Resource consumption for Variant Calling qsub –t 1-36 task.qsub Script Submission Scheduler 0 50 100 100 DNAseq average task mapping recalibration transcripts annotation variant Resource consumption 36 samples (2.7T data) on average requires 128 hours CPU time (ste= 15) 77 GB RAM (ste=0.34) CPU (hours) Real time (hours) Memory (GB) 0 50 100 0 50 100 DNAseqRNAseq cpu cpu_real memory type average task mapping recalibration transcripts annotation variant Resource consumption #PBS –l nodes=2:ppn=8 High-Performance-Compute Denis Bauer | @allPowerde13 |
    14. 14. doi:10.1038/nbt.2421 Tailored processing for different sequencing applications Wet-lab Protocols Production Informatics Variant Calling Methylation Sites Gene Expression Despite different approaches we want to use the same processing framework! Denis Bauer | @allPowerde14 |
    15. 15. reusability cutting edgedata security HPC environment reproducibility robustness adaptability knowledge transfer (publication) efficient Wish list for a framework Denis Bauer | @allPowerde15 |
    16. 16. Denis Bauer | @allPowerde16 |
    17. 17. Denis Bauer | @allPowerde17 |
    18. 18. Assess experimental success quickly Denis Bauer | @allPowerde18 |
    19. 19. DEMO - files Project X fastq Exp1 Run1_read1.fastq Run2_read1.fastq Exp2 Run3_read1.fastq We can start from raw fastq files: here 3 files (Run1-3) in 2 different conditions (Exp1-2) Denis Bauer | @allPowerde19 |
    20. 20. DEMO – setting up config file #******************** # Data #******************** declare -a DIR; DIR=( Exp1 Exp2 ) #******************** # Tasks #******************** RUNMAPPINGBOWTIE2="1" # mapping with bowtie2 #******************** # Paths #******************** # reference genome FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa 20 | Denis Bauer, @allPowerde We specify the folders NGSANE should run on and what to do (here: bowtie2 mapping). We can also specify project specific settings (here: use igenomes)
    21. 21. DEMO – dry run bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt [NGSANE] Trigger mode: [empty] (dry run) [NOTE] Folders: Exp1 Exp2 [Task] bowtie2 [NOTE] setup enviroment [TODO] Exp1/Run1_read1.fastq [TODO] Exp1/Run2_read1.fastq [TODO] Exp2/Run3_read1.fastq [NOTE] proceeding with job scheduling... [NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy [ JOB] /apps/gi/ngsane/ -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 [NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy [ JOB] /apps/gi/ngsane/ -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 [NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy [ JOB] /apps/gi/ngsane/ -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 We run NGSANE in dry run to test what jobs it would submit Denis Bauer | @allPowerde21 |
    22. 22. DEMO – submit bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed [NGSANE] Trigger mode: armed Double check! Then type safetyoff and hit enter to launch the job: safetyoff ... take cover! [NOTE] Folders: Exp1 Exp2 [Task] bowtie2 [NOTE] setup environment [TODO] Exp1/Run1_read1.fastq [TODO] Exp1/Run2_read1.fastq [TODO] Exp2/Run3_read1.fastq [NOTE] proceeding with job scheduling... [NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy [ JOB] /apps/gi/ngsane/ -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 Jobnumber 2424899 [NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy [ JOB] /apps/gi/ngsane/ -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 Jobnumber 2424900 [NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy [ JOB] /apps/gi/ngsane/ -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2 Jobnumber 2424901 We submit HPC jobs. Checkout the returned qsub identifiers. Denis Bauer | @allPowerde22 |
    23. 23. DEMO – scheduler bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c burnet-srv.idpx.hpsc.csiro.au: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:00 2424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:00 2424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00 Three HPC jobs run in parallele because there were three fastq files. But there is no limit to the number of files to process in parallele: easy scale- up to populations. Denis Bauer | @allPowerde23 |
    24. 24. DEMO – report bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html [NGSANE] Trigger mode: html >>>>> Generate HTML report >>>>> startdate Fri Jan 24 08:02:37 EST 2014 >>>>> hostname burnet-login >>>>> makeSummary.sh -k /NGSANEDEMO/config.txt --R --R version 3.0.0 (2013-04-03) -- "Masked Marvel” --Python--Python 2.7.2 QC - bowtie2 >>>>> Generate HTML report - FINISHED >>>>> enddate Fri Jan 24 08:02:39 EST 2014 More report examples Now create the HTML overview page, to check if jobs finised sucessfully and what the results are (bowtie2: mapping statistics) Denis Bauer | @allPowerde24 |
    25. 25. DEMO - files Project X Summary HTML Exp1 Bowtie Run1.bam Run2.bam Exp2 Bowtie Run3.bam fastq Exp1 Run1_read1.fastq Run2_read1.fastq Exp2 Run3_read1.fastq The resulting file structure: every experiment has a folder with the tasks as subfolders and in them the results (here: bam files) Denis Bauer | @allPowerde25 |
    26. 26. NGSANE Currently supports • Transfer data (smbclient) • Quality Control (GATK, FastQC, RNA-SeQC, custom summaries, user code) • Trimming (Cutadapt,Trimgalore, Trimmomatic) • Mapping (BWA,Bowtie1,Bowtie2,Tophat) • Transcript Quantification (cufflinks, htseq, bedtools) • Variant calling (GATK, samtools) • Variant annotation (annovar) • 3D Genome structure (Hicup, fit-hi-c, Hiclib, Homer) Denis Bauer | @allPowerde26 |
    27. 27. For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine Denis Bauer | @allPowerde27 |
    28. 28. Blue Monster says Analyze your data to be reproducible and well documented with tools that scale well to larger datasets Buske FA et al. NGSANE: a lightweight production informatics framework for high-throughput data analysis. Bioinformatics. 2014 PMID: 24470576 Denis Bauer | @allPowerde28 |
    29. 29. Part 3: Combining Omics Data Seeing the full picture requires taking all information into account Denis Bauer | @allPowerde29 |
    30. 30. Result overview: traditional differential analysis 1e−02 1e+00 1e+02 1e−02 1e+00 1e+02 tumour FPKM + 0 normalFPKM+0 1. 722 genes differentially expressed (DE) between tumour and normal • QC: We have good concordance with genes known to be up/down regulated in CRC 2. 841 differentially methylated (DM) genomic regions -- mostly hypermethylated • QC: good concordance with previously reported gut methylation profile 0.1 10.0 0.1 10.0 tumour FPKM + 0 normalFPKM+0 Fernandez et al. Genome Res. 2012CSIRO inhouse Known DE gene Known DM locations Denis Bauer | @allPowerde30 |
    31. 31. Microbial Population:traditional population survey Paul Greenfield Denis Bauer | @allPowerde31 |
    32. 32. Data integration (image credit: Francis Tabary) Denis Bauer | @allPowerde32 |
    33. 33. DNA methylation: Blood signatures in Adipose and Gut samples Tim Peters Some gut/adipose samples have blood- like signatures. Denis Bauer | @allPowerde33 |
    34. 34. Exonseq: blood-signatures stem from a blood-plasma protein ●● ● ● ●● cor = 0.78 ●● ● ● ●● cor = 0.73 ● ● ● ● ● ● cor = 0 ●● ● ● ●● cor = 0.65 0.0e+00 5.0e−06 1.0e−05 1.5e−05 0e+00 2e−05 4e−05 6e−05 8e−05 0.0000 0.0005 0.0010 0.0015 0.0020 0e+00 1e−05 2e−05 ADM2COL6A3FNIP1HAAO cor = −0.2 ●● ● ● ● ● cor = 0.57 ● ● ● ● ●● cor = 0.16 −1e−04 0e+00 0e+00 1e−04 2e−04 0e+00 1e−04 2e−04 HGB2MALT1 0.00 0.25 0.50 0.75 total/reads ● ● ● ● ● ● cor = 0 ●● ● ● ●● cor = 0.65 ● ● ● ● ● ● cor = −0.59 ● ● ● ● ● ● cor = −0.58 ● ● ● ● ● ● cor = −0.51 ● ● ● ● ●● cor = −0.2 ●● ● ● ● ● cor = 0.57 ● ● ● ● ●● cor = 0.16 0.0000 0.0005 0.0010 0.0015 0.0020 0e+00 1e−05 2e−05 −0.0005 0.0000 0.0005 0.0010 0.0015 −0.001 0.000 0.001 −0.001 0.000 0.001 0.002 0.003 −1e−04 0e+00 1e−04 2e−04 0e+00 1e−04 2e−04 0e+00 1e−04 2e−04 FNIP1HAAOHBA1HBA2HBBHGB1HGB2MALT1 0.00 0.25 0.50 0.75 count/total factor(samples) ● ● ● ● ● ● ● ● ● ● ● ● 2 4 7 12 14 19 20 40 50 57 59 62 factor(status) ● lean obese Contamination by ADM2, a gene expressed in blood plasma Individuals Contamination (%) Contamination(%) expression Plasma protein ADM2 makes up most of the human material in the digesta (number of reads mapping to human genome) Denis Bauer | @allPowerde34 |
    35. 35. Medical History: Blood potentially resulting from medication CARTIA 14,50,57 WARFARIN 40 ASPIRIN 59,7 COPLAVIX 12 No anti-clotting drug 2, 62, 4 No medication 19,20 Wilcoxon rank sum test p-value = 0.02 Anti-thrombosis drugs significantly enriched in individuals with human material in digesta. Denis Bauer | @allPowerde35 |
    36. 36. Microbial data: Blood “liking” opportunistic bacteria are enriched in contaminated samples E. coli and Salmonella etc Opportunistic pathogens. Respond to inflammation and bleeding Bacterial marker for low level chronic gut bleeding ? Denis Bauer | @allPowerde36 |
    37. 37. Blue Monster says Integrating different ‘omics data is still a challenge. Denis Bauer | @allPowerde37 |
    38. 38. Three things to remember • Good experimental design is necessary (even) in sequencing experiments • Reproducible, documented data analysis is key (e.g. NGSANE, a lightweight flexible tool for large-scale sequence data analysis on high- performance systems and Amazon’s elastic cloud) • Promising research opportunities are in the integration of multiple high- throughput data sources Denis Bauer | @allPowerde38 |
    39. 39. COMPUTATIONAL INFORMATICS Thank youComputational Informatics Denis C. Bauer t +61 2 9123 4567 e Denis.Bauer@csiro.au w www.csiro.au/bioinformatics Buske et al., Bioinformatics, Jan 2014 More talks online: Twitter: http://www.slideshare.net/allPowerde @allPowerde Fabian A. Buske Susan Clark Hugh French Martin Smith Garvan Institute of Medical Research, Sydney, Australia Robert Dunne Tim Peters Paul Greenfield Piotr Szul Tomasz Bednarz Computational Informatics, CSIRO, Australia Garry Hannan Animal Food and Health Scinece, CSIRO, Australia Rodney Scott University of Newcastle, Australia Funding: National Health and Medical Research Council; National Breast Cancer Foundation; CSIRO's Transformational Capability Platform; CSIRO’s IM&T; Science and Industry Endowment Fund http://www.genome-engineering.com.au/