1. Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
1.
Project
Goal
This
project
is
aimed
at
se0ng
an
experimental
and
analy6cal
workflow
for
classifying
individuals
based
on
the
predicted
robustness
of
their
innate
immune
response
against
influenza
infec6on.
A
total
of
10
physician
volunteers
will
be
evaluated
and
the
top
five
ranked
candidates
will
be
selected
to
aDend
a
humanitarian
mission
at
a
region
affected
by
influenza
outbreak.
The
volunteers
will
be
required
to
donate
a
blood
sample
which
will
be
used
to
assess
the
levels
viral
response
via
an
integrated
genomic
and
gene
expression
signature
analysis
based
on
the
seminal
work
set
by
Lee
et
al.
(1)
performed
as
part
of
the
Phenogene6c
Project
and
ImmVar
Consor6um
(2).
2.
Background
Genome-‐wide
associa6on
studies
(GWAS)
have
been
effec6ve
in
iden6fying
common
gene6c
variants
which
confer
suscep6bility
to
complex
diseases
and
other
phenotypes
of
interest
(3).
However
GWAS
usually
fail
in
pinpoin6ng
the
causa6ve
mechanisms
that
lead
to
such
traits.
This
is
mainly
due
to
variants
oTen
having
small
effect
sizes,
linkage
disequilibrium
between
associated
variants,
liDle
heritability
and
varied
influences
from
the
environment
(4).
Interes6ngly,
the
vast
majority
of
disease
associated
single
nucleo6de
polymorphisms
(SNP)
are
mapped
to
the
non-‐coding
vicinity
of
genes,
oTen
affec6ng
their
level
of
expression
(5).
Variants
that
exert
their
regulatory
effect
on
the
steady-‐state
levels
of
expression
have
been
called
expression
Quan6ta6ve
Trait
Loci
(eQTL)
or
response
Quan6ta6ve
Trait
Loci
(reQTL)
when
the
expression
interference
is
dependent
on
a
s6mulus(6)
and
these
QTLs
may
manifest
themselves
in
a
6ssue
or
cell-‐type
specific
manner
(7).
Context
specific
reQTLs
are
of
extreme
relevance
in
immunity
to
infec6on
since
the
quan6ta6ve
changes
in
gene
expression
alter
the
outcome
of
immune
responses
to
a
wide
variety
of
perturba6ons
such
as
infec6on
with
various
pathogens
and
vaccina6on(8).
Moreover,
some
viral
infec6ons
(e.g.
rhinovirus)
can
lead
to
a
wide
spectrum
of
symptoms
severity
which
is,
at
least
in
part,
influenced
by
the
inter-‐individual
reQTL
varia6ons
from
the
hosts
(9).
Furthermore,
different
pathogen
s6muli
can
affect
shared
reQTL
associated
genes
or
specific
ones
as
has
been
shown
in
a
recent
study
which
assessed
the
effect
of
s6mula6on
of
dendri6c
cells
(DC)
with
bacterial
lipopolysaccharide
(LPS)
,
influenza
virus
and
interferon-‐β
(1).
Of
the
commonly
found
121
reQTLs
(minor
allele
frequency
>5%)
iden6fied
in
this
study
only
7
loci
had
an
effect
on
near
by
genes
specifically
in
response
to
influenza
infec6on
which
make
these
reQTLs
interes6ng
parameters
for
the
assessment
of
the
response
effec6veness
to
such
viral
infec6ons.
Decisively,
the
profiling
of
blood
transcriptomics
integrated
with
genomics
scale
variant
genotyping
provides
an
aDrac6ve
mean
for
evalua6ng
the
immune
status
of
individuals
(10).
Therefore
we
propose
to
test
for
the
presence
of
the
most
significant
5
influenza
specific
reQTLs
(Table
1)
and
the
up-‐regula6on
levels
of
the
associated
genes
as
a
proxy
for
a
robust
immune
response
against
influenza
in
order
to
select
five
physician
candidates
who
would
be
best
suited
to
aDend
a
popula6on
affected
by
a
flu
outbreak.
3.
Experimental
approach:
In
order
to
have
comparable
results
we
will
use
the
experimental
approach
set
by
Lee
et
al.
(1)
with
some
modifica6ons
to
account
for
the
current
technological
improvements
as
described
below.
1
2. Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
3.1.
Whole
genome
re-‐sequencing/variant
calling:
The
peripheral
blood
mononuclear
cells
(PBMC)
isolated
from
the
volunteers’
blood
will
be
used
for
variant
calling
by
next
genera6on
whole
genome
re-‐sequencing
to
obtain
improved
variant
resolu6on.
DCs
will
be
derived
from
PBMCs
and
infected
with
influenza
virus
as
described
in
the
original
Science
report.
3.2.
DC
enrichment/single
cell
RNA-‐Seq:
Since
the
observed
reQTLs
effect
sizes
could
vary
considerably
when
evalua6ng
DCs
as
a
bulk
due
to
heterogeneity
in
the
cell
popula6on
composi6on
we
opted
to
perform
a
magne6c
cell
enrichment
of
the
DC
popula6on
with
the
Blood
Dendri6c
Cell
Isola6on
Kit
II,
human
(Miltenyi
Biotec)
which
results
in
an
enriched
cell
frac6on
comprising
plasmacytoid
dendri6c
cells,
CD1c
(BDCA-‐1)+
type-‐1
myeloid
dendri6c
cells
(MDC1s),
and
CD1c
(BDCA-‐1)-‐
CD141
(BDCA-‐3)
bright
type-‐2
myeloid
dendri6c
cells
(MDC2s).
This
enriched
DC
frac6on
will
have
its
cell
content
profiled
via
flow
cytometry
analysis
followed
by
single
cell
RNA-‐Seq
(2
x
96
cells
per
individual
res6ng
vs.
infected
cell
samples)
using
a
Fluidigm
plajorm,
which
will
also
help
to
improve
reQTL
analysis
resolu6on
(11).
4.
ComputaJonal
Analyses:
The
computa6on
analyses
for
the
transcriptomic
and
genomic
datasets
are
described
bellow
and
their
respec6ve
flow
charts
can
be
found
in
annexe
at
the
end
of
this
document.
4.1.
Fastq
files
Both
the
transcriptome
and
genome
assembly
pipelines
use
as
input
fastq
format
files.
Fastq
file
is
a
text
file,
which
contains
the
raw
informa6on
from
each
read
coming
out
of
next
genera6on
sequencing
experiment.
Each
read
is
represented
as
four
text
lines.
The
first
line
starts
with
an
@
sign
followed
by
the
sequence
iden6fier
with
some
op6onal
descrip6on.
The
second
line
contains
the
actual
nucleo6de
sequence
of
the
read.
The
third
line
contains
only
a
+
sign
that
marks
the
beginning
of
the
nucleo6de
quality
scores
in
the
fourth
line.
Each
nucleo6de
is
associated
with
a
Phred
quality
score
that
es6mates
its
reliability.
These
quality
scores
are
coded
in
ASCII
code
and
usually
the
aligners
assumes
Sanger
format
encoding
Phred+33
as
default
as
this
is
the
current
standard
format
since
Illumina
1.8.
4.2
DifferenJal
Expression
RNA-‐Seq
(Tuxedo)
Pipeline
(View
annexed
figure
1-‐A)
1)
TopHat
-‐
RNA-‐Seq
alignment:
High
quality
single-‐end
RNA-‐seq
reads
(fastq
files
as
input)
for
all
the
single
cell
samples
will
be
mapped
to
a
human
reference
genome
(hg19
build)
using
TopHat
aligner
(13).
TopHat
is
an
alignment
program
which
has
been
specifically
designed
for
the
analyses
of
RNA-‐Seq
data
and
is
therefore
able
to
map
reads
to
the
genome
even
when
the
reads
span
splice
junc6ons
whose
genomic
regions
can
be
separated
by
rela6vely
large
intronic
regions.
TopHat
produces
the
following
output
files
for
each
sample
aligned:
a)
align_summary.txt,
b)
inser6ons.bed,
c)
dele6ons.bed,
d)
splice_junc6ons.bed
and
e)
accepted_hits.bam.
2)
Cufflinks
-‐
Transcripts
assembly:
The
mapped
reads
for
the
expressed
genes
and
transcripts
will
be
assembled
for
each
sample
(accepted_hits.bam
files)
using
Cufflinks
and
a
reference
gene
annota6on
(hg19.gj)
to
es6mate
isoform
expression.
Cufflinks
is
both
the
name
of
a
suite
of
tools
and
a
program
within
that
suite.
The
program
Cufflinks
assembles
transcriptomes
from
RNA-‐Seq
data
and
quan6fies
their
expression
producing
the
following
output
files
for
each
sample:
a)
gene_expression.tabular,
b)
transcript_expression.tabular,
c)
assembled_transcripts.gj
and
d)
skipped_transcripts.gj.
2
3. Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
3-‐4)
Cuffmerge
-‐
Merge
assemblies:
A
file
called
assemblies.txt
that
lists
the
assembly
file
for
each
sample
(assembled_transcripts.gj)
will
be
created.
This
file
is
used
for
running
Cuffmerge
on
all
the
assemblies
(assemblies.txt)
to
create
a
single
merged
transcriptome
annota6on
using
the
references
for
gene
annota6on
(hg19.gj)
and
genomic
regions
(hg19_genome.fasta).
The
output
of
Cuffmerge
is
a
single
GTF
file
that
contains
an
assembly
that
merges
together
all
the
input
assemblies.
5)
Cuffdiff
-‐
DifferenJal
expression
inference:
We
will
run
Cuffdiff
using
as
input
the
merged
transcriptome
assembly
GTF
file
along
with
the
BAM
files
(accepted_hits.bam)
from
TopHat
for
each
sample.
Cuffdiff
is
used
for
finding
significant
changes
in
transcript
expression,
splicing,
and
promoter
use
and
produces
11
output
files:
a.
Transcript
FPKM
(+count)
expression
tracking,
b.
Gene
FPKM
(+count)
expression
tracking,
c.
Primary
transcript
FPKM
(+count)
tracking,
d.
Coding
sequence
FPKM
(+count)
tracking,
e.
Transcript
differen6al
FPKM,
f.
Gene
differen6al
FPKM,
g.
Primary
transcript
differen6al
FPKM,
h.
Coding
sequence
differen6al
FPKM,
i.
Differen6al
splicing
tests,
j.
Differen6al
promoter
tests,
k.
Differen6al
CDS
tests.
6-‐18)
DifferenJal
expression
analysis:
The
differen6al
expression
analysis
results
will
be
explored
with
CummeRbund
in
R
environment.
Cummerbund
uses
all
the
output
files
from
Cuffdiff
to
create
a
SQLite
database
of
results
with
the
descrip6on
of
the
rela6onship
between
genes,
transcripts,
transcrip6on
start
sites
and
CDS
regions.
The
stored
and
indexed
data
can
be
used
for
exploring
sub
features
of
individual
genes
or
gene
sets
and
used
for
plot
visualisa6ons
of
the
data.
4.3.
Genome
Re-‐Sequencing
Assembly
and
Variant
Calling
(View
annexed
figure
1-‐B)
1)
BWA-‐MEM
-‐
Map
genomic
reads
for
each
subject:
BWA
is
a
soTware
package
for
mapping
low-‐divergent
sequences
against
a
large
reference
genome,
such
as
the
human
genome.
BWA-‐MEM
is
the
latest
algorithm
which
was
designed
for
fast
and
accurate
mapping
of
high
quality
Illumina
sequence
reads
ranging
from
70bp
to
1Mbp
(14).
High
quality
paired-‐end
reads
in
a
fastq
file
format
are
used
as
input
together
with
a
reference
genome
fasta
file
such
as
the
human
genome
hg19
build.
BWA-‐MEM
produces
a
compressed
binary
BAM
file
as
an
output.
2)
Picard,
AddOrReplaceReadGroups
tool
-‐
label
read
groups:
We
use
this
Picard
tool
func6on
to
label
the
reads
from
each
sample
in
the
BAM
files
before
merging
them
for
further
analysis.
3)
Picard,
MergeSamFiles
-‐
Merge
the
read
group
labelled
BAM
files:
MergeSamFiles
is
also
a
component
of
Picard
and
is
used
for
merging
mul6ple
SAM/BAM
files
into
one
file.
4)
Samtools,
Filter
reads:
Samtools
is
used
to
filter
high
quality
mapped
and
proper
paired
reads.
5)
Picard,
Paired
Read
Mate
Fixer
-‐
Sort
reads
by
coordinates:
This
Picard
func6on
can
be
used
to
adjust
the
ordering
of
reads.
6)
Picard,
MarkDuplicates
-‐
Remove
all
duplicated
reads:
We
use
the
func6on
MarkDuplicates
to
remove
duplicates
which
are
amplifica6on
artefacts
from
library
prepara6on.
7)
FreeBayes
-‐
Call
variants:
FreeBayes
(15)
is
a
Bayesian
based
gene6c
variant
detector
designed
to
find
small
polymorphisms
(SNPs,
indels,
MNPs
and
complex
events)
smaller
than
the
length
of
a
short-‐read
sequencing
alignment.
The
processed
short-‐read
alignment
BAM
3
4. Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
files
with
Phred+33
encoded
quality
scores
and
a
reference
genome
in
fasta
format
are
used
by
FreeBayes
to
determine
the
most-‐likely
haplotype
for
the
individuals
at
each
posi6on
in
the
reference.
The
output
of
this
variant
caller
is
a
variant
call
file
(VCF)
format
that
reports
the
posi6ons,
which
it
finds
puta6vely
polymorphic.
8)
VCFlib
VCFfilter
-‐
Filter
for
high
confidence
and
high
coverage
variant
calls:
We
use
VCFfilter
to
select
only
high
confidence
variant
calls
based
on
Phred
score
“QUAL
40
(false
discovery
rate
(FDR)
of
1
in
10,000)
and
enough
read
coverage
depth
“DP
5”.
9)
ANNOVAR
-‐
Annotate
variants:
Finally
ANNOVAR
(16)
is
used
to
func6onally
annotate
the
gene6c
variants
detected
and
iden6fy
variants
that
are
documented
in
specific
databases
such
as
dbSNP
and
reports
its
allele
frequency
base
on
the
1000
Genome
Project,
NHLBI-‐ESP
6500
exomes
or
Exome
Aggrega6on
Consor6um
5.
Expected
Data
InterpretaJon
of
Data
We
will
search
for
the
top
5
most
significant
FLU
specific
reQTLs
characterised
by
Lee
et.
al.
(1)
within
the
variants
iden6fied
from
our
10
genotyped
candidates.
The
presence
of
such
variants
will
be
correlated
with
a
significant
up-‐regula6on
of
their
associated
genes
in
the
single
cell
samples
analysed
aTer
influenza
infec6on
in
comparison
to
unperturbed
state
(Table
1).
For
this
correla6on
we
will
take
in
considera6on
any
heterogeneity
of
cellular
composi6on
as
covariates
for
adjus6ng
the
associated
response.
Ul6mately
the
top
5
candidates
that
has
the
highest
degrees
of
reQTL
correla6ons
with
the
most
significant
gene
up-‐regula6ons
aTer
influenza
infec6on
will
be
deemed
the
most
fit
responders
for
our
proposed
mission.
Table1
:
Top
5
FLU
specific
reQTL
and
the
associated
genes
with
the
most
significantly
up-‐
regulated
expression
in
response
to
specific
to
influenza
virus
characterised
by
Lee
et
al.
(1).
The
SNPs/reQTLs
associated
genes
were
sorted
based
on
M-‐value
0.9
and
M-‐value
0.1
inclusion
and
exclusion
criteria
respec6vely
and
on
their
significance
levels
using
data
from
supplementary
table
4
sheet
I.
Delta
Meta
from
Lee
et
al.
(1).
SNP
ID Gene
reQTL
FLU
p-‐value
FLU
M-‐value
LPS
p-‐value
LPS
M-‐value
IFN
p-‐value
exm-‐rs1019503 ERAP2 TRUE 4.0217E-‐212 1 7.30441E-‐11 0 5.34953E-‐32
rs6752483 ADCY3 TRUE 4.92854E-‐24 1 0.000746454 0.081 0.559143
rs2285712 CCDC109B TRUE 2.32421E-‐20 1 0.0845543 0 0.407339
rs2834160 IFNAR2 TRUE 4.09777E-‐20 1 0.027355 0.001 0.0984183
rs1477478 IFNA21 TRUE 4.28414E-‐18 1 0.351075 0 0.891976
4
5. Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
References
1.
M.
N.
Lee
et
al.,
Common
gene6c
variants
modulate
pathogen-‐sensing
responses
in
human
dendri6c
cells.
Science.
343,
1246980
(2014).
2.
P.
L.
De
Jager
et
al.,
ImmVar
project:
Insights
and
design
considera6ons
for
future
studies
of
“healthy”
immune
varia6on.
Semin.
Immunol.
27,
51–57
(2015).
3.
D.
Altshuler,
M.
J.
Daly,
E.
S.
Lander,
Gene6c
mapping
in
human
disease.
Science.
322,
881–888
(2008).
4.
W.
G.
Feero,
A.
E.
GuDmacher,
T.
A.
Manolio,
Genomewide
Associa6on
Studies
and
Assessment
of
the
Risk
of
Disease.
N
Engl
J
Med.
363,
166–176
(2010).
5.
M.
A.
Schaub,
A.
P.
Boyle,
A.
Kundaje,
S.
Batzoglou,
M.
Snyder,
Linking
disease
associa6ons
with
regulatory
informa6on
in
the
human
genome.
Genome
Res.
22,
1748–1759
(2012).
6.
I.
Gat-‐Viks
et
al.,
Deciphering
molecular
circuits
from
gene6c
varia6on
underlying
transcrip6onal
responsiveness
to
s6muli.
Nat.
Biotechnol.
31,
342–349
(2013).
7.
G.
Gibson,
J.
E.
Powell,
U.
M.
Marigorta,
Expression
quan6ta6ve
trait
locus
analysis
for
transla6onal
medicine.
Genome
Med.
7,
60
(2015).
8.
B.
P.
Fairfax,
J.
C.
Knight,
Gene6cs
of
gene
expression
in
immunity
to
infec6on.
Curr
Opin
Immunol.
30,
63–71
(2014).
9.
M.
Çalışkan,
S.
W.
Baker,
Y.
Gilad,
C.
Ober,
Host
gene6c
varia6on
influences
gene
expression
response
to
rhinovirus
infec6on.
PLoS
Genet.
11,
e1005111
(2015).
10.
D.
Chaussabel,
Assessment
of
immune
status
using
blood
transcriptomics
and
poten6al
implica6ons
for
global
health.
Semin.
Immunol.
27,
58–66
(2015).
11.
H.-‐J.
Westra,
L.
Franke,
From
genome
to
func6on
by
studying
eQTLs.
Biochim.
Biophys.
Acta.
1842,
1896–1902
(2014).
12.
C.
Trapnell
et
al.,
Differen6al
gene
and
transcript
expression
analysis
of
RNA-‐seq
experiments
with
TopHat
and
Cufflinks.
Nat
Protoc.
7,
562–578
(2012).
13.
C.
Trapnell
et
al.,
Transcript
assembly
and
quan6fica6on
by
RNA-‐Seq
reveals
unannotated
transcripts
and
isoform
switching
during
cell
differen6a6on.
Nat.
Biotechnol.
28,
511–515
(2010).
14.
H.
Li,
Aligning
sequence
reads,
clone
sequences
and
assembly
con6gs
with
BWA-‐
MEM.
arXiv
(2013).
15.
E.
Garrison,
G.
Marth,
Haplotype-‐based
variant
detec6on
from
short-‐read
sequencing.
arXiv
(2012).
16.
K.
Wang,
M.
Li,
H.
Hakonarson,
ANNOVAR:
func6onal
annota6on
of
gene6c
variants
from
high-‐throughput
sequencing
data.
Nucleic
Acids
Res.
38,
e164–e164
(2010).
5
6. Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
Annexed
Figure
1
A-‐Tuxedo
DifferenJal
Expression
RNA-‐Seq
Pipeline
flowchart
extracted
from
(12)
B-‐Genome
re-‐sequencing
assembly
and
variant
calling
workflow
BWA-MEM
Map$genomic$reads
Picard
Add$read$groups
Picard
Merge$BAM$files
Samtools
Filter$reads
Picard
Sort$by$coordinates
Picard
Remove$duplicates
FreeBayes
Call$variants
VCFlib
VCF$filter
ANNOVAR
Annotate$variants
Annotated high
quality variants
Paired reads
BAM files
Fastq files
RG labeled BAM files
Files merged into one BAM file
High quality BAM file
Sorted BAM file
Dedup BAM file
VCF file
High quality calls (VCF)
VCF file
Genome reference
hg19.fasta
6