Population-scale high-throughput sequencing data analysis

Denis C. Bauer | Bioinformatics | @allPowerde
08 July 2014
CSIRO COMPUTATIONAL INFORMATICS
Population-scale high-throughput sequencing
data analysis
ByMelody

Talk Overview
2 |
• Background: CSIRO/Omics Project
• Methods: NGS Data Processing on HPC/Cloud
• Research Outcome: Cancer and Microbes in Colorectal Cancer
Denis Bauer | @allPowerde

62% of our people hold
university degrees
2000 doctorates
500 masters
With our university
partners, we develop
650 postgraduate
research students
Top 1% of global research
institutions in 14 of 22 research
fields
Top 0.1% in 4 research fields
Darwin
Alice Springs
Geraldton
2 sites
Atherton
Townsville
2 sites
Rockhampton
Toowoomba
Gatton
Myall Vale
Narrabri
Mopra
Parkes
Griffith
Belmont
Geelong
Hobart
Sandy Bay
Wodonga
Newcastle
Armidale
2 sites
Perth
3 sites
Adelaide
2 sites Sydney 5 sites
Canberra 7 sites
Murchison
Cairns
Irymple
Melbourne 5 sites
CSIRO: Who we are
Werribee 2 sites
Brisbane
6 sites
Bribie
Island
People
Divisions
Locations
Flagships
Budget
6500
13
58
11
$1B+
The Commonwealth Scientific and Industrial Research Organisation
Denis Bauer | @allPowerde3 |

Our business units
12Research Divisions11National Research Flagships
+National Research Facilities
and Collections
FOOD, HEALTH
& LIFE SCIENCE
INDUSTRIES
ENVIRONMENT MANUFACTURING,
MATERIALS &
MINERALS
ENERGY INFORMATION &
COMMUNICATIONS
+Transformational
Capability Platforms

Our track record: top inventions
4. EXTENDED
WEAR CONTACTS
2. POLYMER
BANKNOTES
3. RELENZA
FLU VACCINE
1. Fast WLAN
Wireless Local
Area Network
5. AEROGARD 6. TOTAL
WELLBEING DIET
7. RAFT
POLYMERISATION
8. BARLEYMAX 9. SELF TWISTING
YARN
10. SOFTLY
WASHING LIQUID

Part 1: The ‘omics project
The goalof the project is to investigate the
susceptibility to colorectal cancer in the
context of obesity and the gut
microbiome

Data from Pilot Study
Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private
Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith)
organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)

• Objective: capture genomic variances reliably in tumour normal
and adipose.
• Sequence effort:
• 12 tumour -> 6 lanes (2-plex)
• 12 normal -> 3 lanes (4-plex)
• 12 adipose -> 3 lanes (4-plex)
Considerations before sequencing: Undersampling
More depth needed due to
potentially low cellularity in
the tumour sample
additional
depth
tumour sample
normal sample

• Objective: process samples avoiding confounding factors
Considerations before sequencing: Flowcell design
L1
L2
L2
L2
O1
O1
O1
O2
O2
O2
Sequenced
over 3 lanes
L1
L1
Normal
Adipose
Tumour
4-plex
4-plex
4-plex
L2
O2
L1
O1
L2
O2
L1
O1
Sequence on
one lane each
L2
O2
L1
O1
Subject every
sample to the same
lane and flowcell
effects by
multiplexing
(labelling every
sample with a
identifying barcode)

• Population-scale sequencing with more samples than illumina-barcodes: imbalanced
flowcell design will split samples and pair the halves with different partners (e.g.
LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 )
Considerations for Omics Proj.: Flowcell design
L1.1
L1.1
O1.1
O1.1
O1.1
L1.1
Normal
Adipose
Tumour
L2.1
L2.1
L2.1
O2.1
O2.1
O2.1
L3.1
L3.1
L3.1
O3.1
O3.1
O3.1
L4.1
L4.1
L4.1
O4.1
O4.1
O4.1
Lane1
Lane2 Lane3 Lane4
L1.2
L1.2
O3.2
O3.2
O3.2
L2.2
L2.2
L2.2
O4.2
O4.2
O4.2
L3.2
L3.2
L3.2
O1.2
O1.2
O1.2
L4.2
L4.2
L4.2
O2.2
O2.2
O2.2
Lane5 Lane6 Lane7 Lane8
L1.2
4-plex
4-plex
2-plex
L=Lean
O=Obese
L1.1=Lean individual 1
part 1 (of 2) ...
12 Lanes
Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781

Blue Monster says
Design your experiment with project-
specific pitfalls in mind
Auer PL et al. Statistical design and analysis of RNA sequencing data.
Genetics. 2010 PMID: 20439781

Part 2: NGS Data Processing
Minimize project set-up overhead
while providing easily adaptable processing modules
for NGS analysis on high-performance-
compute clusters/cloud

Resource consumption for Variant Calling
qsub –t 1-36 task.qsub
Script
Submission
Scheduler
0
50
100
100
DNAseq
average
task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
36 samples (2.7T data) on average requires
128 hours CPU time (ste= 15)
77 GB RAM (ste=0.34)
CPU
(hours)
Real time
(hours)
Memory
(GB)
0
50
100
0
50
100
DNAseqRNAseq
cpu
cpu_real
memory
type
average task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
#PBS –l nodes=2:ppn=8
High-Performance-Compute

doi:10.1038/nbt.2421
Tailored processing for different sequencing applications
Wet-lab Protocols Production Informatics
Variant
Calling
Methylation
Sites
Gene
Expression
Despite different approaches
we want to use the same
processing framework!

reusability
cutting edgedata security
HPC environment
reproducibility
robustness
adaptability
knowledge transfer
(publication)
efficient
Wish list for a framework

Assess
experimental
success
quickly

DEMO - files
Project X fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
We can start from raw fastq files: here
3 files (Run1-3) in 2 different
conditions (Exp1-2)

DEMO – setting up config file
#********************
# Data
#********************
declare -a DIR; DIR=( Exp1 Exp2 )
#********************
# Tasks
#********************
RUNMAPPINGBOWTIE2="1" # mapping with bowtie2
#********************
# Paths
#********************
# reference genome
FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
20 | Denis Bauer, @allPowerde
We specify the folders NGSANE
should run on and what to do (here:
bowtie2 mapping). We can also
specify project specific settings (here:
use igenomes)

DEMO – dry run
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt
[NGSANE] Trigger mode: [empty] (dry run)
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup enviroment
[TODO] Exp1/Run1_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
We run NGSANE in dry run to test
what jobs it would submit

DEMO – submit
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed
[NGSANE] Trigger mode: armed
Double check! Then type safetyoff and hit enter to launch the job: safetyoff
... take cover!
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup environment
[NOTE] proceeding with job scheduling...
Jobnumber 2424899
Jobnumber 2424900
Jobnumber 2424901
We submit HPC jobs. Checkout the
returned qsub identifiers.

DEMO – scheduler
bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c
burnet-srv.idpx.hpsc.csiro.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:00
Three HPC jobs run in parallele because there
were three fastq files. But there is no limit to the
number of files to process in parallele: easy scale-
up to populations.

DEMO – report
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html
[NGSANE] Trigger mode: html
>>>>> Generate HTML report
>>>>> startdate Fri Jan 24 08:02:37 EST 2014
>>>>> hostname burnet-login
>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt
--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”
--Python--Python 2.7.2
QC - bowtie2
>>>>> Generate HTML report - FINISHED
>>>>> enddate Fri Jan 24 08:02:39 EST 2014
More report examples
Now create the HTML overview page,
to check if jobs finised sucessfully and
what the results are (bowtie2:
mapping statistics)

DEMO - files
Project X
Summary HTML
Exp1 Bowtie
Run1.bam
Run2.bam
Exp2 Bowtie Run3.bam
fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
The resulting file structure: every
experiment has a folder with the tasks
as subfolders and in them the results
(here: bam files)

NGSANE Currently supports
• Transfer data (smbclient)
• Quality Control
(GATK, FastQC, RNA-SeQC, custom summaries,
user code)
• Trimming
(Cutadapt,Trimgalore, Trimmomatic)
• Mapping
(BWA,Bowtie1,Bowtie2,Tophat)
• Transcript Quantification
(cufflinks, htseq, bedtools)
• Variant calling
(GATK, samtools)
• Variant annotation
(annovar)
• 3D Genome structure
(Hicup, fit-hi-c, Hiclib, Homer)

For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine

Blue Monster says
Analyze your data to be reproducible
and well documented with tools that
scale well to larger datasets
Buske FA et al. NGSANE: a lightweight production
informatics framework for high-throughput data analysis.
Bioinformatics. 2014 PMID: 24470576

Part 3: Combining Omics Data
Seeing the full picture requires taking all
information into account

Result overview: traditional differential analysis
1e−02
1e+00
1e+02
1e−02 1e+00 1e+02
tumour FPKM + 0
normalFPKM+0
1. 722 genes differentially expressed (DE) between tumour and
normal
• QC: We have good concordance with genes known to be up/down regulated in CRC
2. 841 differentially methylated (DM) genomic regions -- mostly
hypermethylated
• QC: good concordance with previously reported gut methylation profile
0.1
10.0
0.1 10.0
tumour FPKM + 0
normalFPKM+0
Fernandez et al. Genome Res. 2012CSIRO inhouse
Known DE gene Known DM locations

Microbial Population:traditional population survey
Paul Greenfield

Data integration
(image credit: Francis Tabary)

DNA methylation: Blood signatures in Adipose and Gut samples
Tim Peters
Some gut/adipose
samples have blood-
like signatures.

Exonseq: blood-signatures stem from a blood-plasma protein
●●
●
●
●●
cor = 0.78
●●
●
●
●●
cor = 0.73
●
● ●
●
●
●
cor = 0
●●
●
●
●●
cor = 0.65
0.0e+00
5.0e−06
1.0e−05
1.5e−05
0e+00
2e−05
4e−05
6e−05
8e−05
0.0000
0.0005
0.0010
0.0015
0.0020
0e+00
1e−05
2e−05
ADM2COL6A3FNIP1HAAO
cor = −0.2
●●
● ●
●
●
cor = 0.57
●
●
●
●
●● cor = 0.16
−1e−04
0e+00
0e+00
1e−04
2e−04
0e+00
1e−04
2e−04
HGB2MALT1
0.00 0.25 0.50 0.75
total/reads
●
● ●
●
●
●
cor = 0
●●
●
●
●●
cor = 0.65
●
●
● ●
●
●
cor = −0.59
●
●
● ●
●
●
cor = −0.58
●
●
● ●
●
●
cor = −0.51
●
● ● ●
●●
cor = −0.2
●●
● ●
●
●
cor = 0.57
●
●
●
●
●● cor = 0.16
0.0000
0.0005
0.0010
0.0015
0.0020
0e+00
1e−05
2e−05
−0.0005
0.0000
0.0005
0.0010
0.0015
−0.001
0.000
0.001
−0.001
0.000
0.001
0.002
0.003
−1e−04
0e+00
1e−04
2e−04
0e+00
1e−04
2e−04
0e+00
1e−04
2e−04
FNIP1HAAOHBA1HBA2HBBHGB1HGB2MALT1
0.00 0.25 0.50 0.75
count/total
factor(samples)
●
●
●
●
●
●
●
●
●
●
●
●
2
4
7
12
14
19
20
40
50
57
59
62
factor(status)
● lean
obese
Contamination by ADM2, a gene expressed in blood plasma
Individuals
Contamination (%)
Contamination(%)
expression
Plasma protein ADM2 makes up most of
the human material in the digesta (number
of reads mapping to human genome)

Medical History: Blood potentially resulting from medication
CARTIA
14,50,57
WARFARIN
40
ASPIRIN
59,7
COPLAVIX
12
No anti-clotting drug 2, 62, 4
No medication 19,20
Wilcoxon rank sum test p-value = 0.02
Anti-thrombosis drugs
significantly enriched in
individuals with human
material in digesta.

Microbial data: Blood “liking” opportunistic bacteria are enriched
in contaminated samples
E. coli and Salmonella etc
Opportunistic pathogens.
Respond to inflammation
and bleeding
Bacterial marker for low level
chronic gut bleeding ?

Blue Monster says
Integrating different ‘omics data is
still a challenge.

Three things to remember
• Good experimental design is necessary
(even) in sequencing experiments
• Reproducible, documented data
analysis is key (e.g. NGSANE, a
lightweight flexible tool for large-scale
sequence data analysis on high-
performance systems and Amazon’s
elastic cloud)
• Promising research opportunities are in
the integration of multiple high-
throughput data sources

COMPUTATIONAL INFORMATICS
Thank youComputational Informatics
Denis C. Bauer
t +61 2 9123 4567
e Denis.Bauer@csiro.au
w www.csiro.au/bioinformatics
Buske et al.,
Bioinformatics,
Jan 2014
More talks online: Twitter:
http://www.slideshare.net/allPowerde @allPowerde
Fabian A. Buske
Susan Clark
Hugh French
Martin Smith
Garvan Institute of Medical
Research, Sydney, Australia
Robert Dunne
Tim Peters
Paul Greenfield
Piotr Szul
Tomasz Bednarz
Computational Informatics,
CSIRO, Australia
Garry Hannan
Animal Food and Health Scinece,
CSIRO, Australia
Rodney Scott
University of Newcastle, Australia
Funding:
National Health and Medical
Research Council;
National Breast Cancer
Foundation;
CSIRO's Transformational
Capability Platform;
CSIRO’s IM&T;
Science and Industry Endowment
Fund
http://www.genome-engineering.com.au/

Population-scale high-throughput sequencing data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Population-scale high-throughput sequencing data analysis

Similar to Population-scale high-throughput sequencing data analysis (20)

More from Denis C. Bauer

More from Denis C. Bauer (13)

Population-scale high-throughput sequencing data analysis

Editor's Notes