FabioAmaralProject 3

Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
1.
Project
Goal

This
project
is
aimed
at
se0ng
an
experimental
and
analy6cal
workflow
for
classifying

individuals
based
on
the
predicted
robustness
of
their
innate
immune
response
against

influenza
infec6on.
A
total
of
10
physician
volunteers
will
be
evaluated
and
the
top
five

ranked
candidates
will
be
selected
to
aDend
a
humanitarian
mission
at
a
region
affected
by

influenza
outbreak.
The
volunteers
will
be
required
to
donate
a
blood
sample
which
will
be

used
to
assess
the
levels
viral
response
via
an
integrated
genomic
and
gene
expression

signature
analysis
based
on
the
seminal
work
set
by
Lee
et
al.
(1)
performed
as
part
of
the

Phenogene6c
Project
and
ImmVar
Consor6um
(2).

2.
Background

Genome-‐wide
associa6on
studies
(GWAS)
have
been
effec6ve
in
iden6fying
common

gene6c
variants
which
confer
suscep6bility
to
complex
diseases
and
other
phenotypes
of

interest
(3).
However
GWAS
usually
fail
in
pinpoin6ng
the
causa6ve
mechanisms
that
lead
to

such
traits.
This
is
mainly
due
to
variants
oTen
having
small
effect
sizes,
linkage

disequilibrium
between
associated
variants,
liDle
heritability
and
varied
influences
from
the

environment
(4).

Interes6ngly,
the
vast
majority
of
disease
associated
single
nucleo6de
polymorphisms
(SNP)

are
mapped
to
the
non-‐coding
vicinity
of
genes,
oTen
affec6ng
their
level
of
expression
(5).

Variants
that
exert
their
regulatory
effect
on
the
steady-‐state
levels
of
expression
have
been

called
expression
Quan6ta6ve
Trait
Loci
(eQTL)
or
response
Quan6ta6ve
Trait
Loci
(reQTL)

when
the
expression
interference
is
dependent
on
a
s6mulus(6)
and
these
QTLs
may

manifest
themselves
in
a
6ssue
or
cell-‐type
specific
manner
(7).

Context
specific
reQTLs
are
of
extreme
relevance
in
immunity
to
infec6on
since
the

quan6ta6ve
changes
in
gene
expression
alter
the
outcome
of
immune
responses
to
a
wide

variety
of
perturba6ons
such
as
infec6on
with
various
pathogens
and
vaccina6on(8).

Moreover,
some
viral
infec6ons
(e.g.
rhinovirus)
can
lead
to
a
wide
spectrum
of
symptoms

severity
which
is,
at
least
in
part,
influenced
by
the
inter-‐individual
reQTL
varia6ons
from
the

hosts
(9).

Furthermore,
different
pathogen
s6muli
can
affect
shared
reQTL
associated
genes
or
specific

ones
as
has
been
shown
in
a
recent
study
which
assessed
the
effect
of
s6mula6on
of

dendri6c
cells
(DC)
with
bacterial
lipopolysaccharide
(LPS)
,
influenza
virus
and
interferon-‐β

(1).
Of
the
commonly
found
121
reQTLs
(minor
allele
frequency
>5%)
iden6fied
in
this
study

only
7
loci
had
an
effect
on
near
by
genes
specifically
in
response
to
influenza
infec6on

which
make
these
reQTLs
interes6ng
parameters
for
the
assessment
of
the
response

effec6veness
to
such
viral
infec6ons.

Decisively,
the
profiling
of
blood
transcriptomics
integrated
with
genomics
scale
variant

genotyping
provides
an
aDrac6ve
mean
for
evalua6ng
the
immune
status
of
individuals
(10).

Therefore
we
propose
to
test
for
the
presence
of
the
most
significant
5
influenza
specific

reQTLs
(Table
1)
and
the
up-‐regula6on
levels
of
the
associated
genes
as
a
proxy
for
a
robust

immune
response
against
influenza
in
order
to
select
five
physician
candidates
who
would

be
best
suited
to
aDend
a
popula6on
affected
by
a
flu
outbreak.

3.
Experimental
approach:

In
order
to
have
comparable
results
we
will
use
the
experimental
approach
set
by
Lee
et
al.

(1)
with
some
modifica6ons
to
account
for
the
current
technological
improvements
as

described
below.

1

3.1.
Whole
genome
re-‐sequencing/variant
calling:
The
peripheral
blood
mononuclear
cells

(PBMC)
isolated
from
the
volunteers’
blood
will
be
used
for
variant
calling
by
next

genera6on
whole
genome
re-‐sequencing
to
obtain
improved
variant
resolu6on.
DCs
will
be

derived
from
PBMCs
and
infected
with
influenza
virus
as
described
in
the
original
Science

report.

3.2.
DC
enrichment/single
cell
RNA-‐Seq:
Since
the
observed
reQTLs
effect
sizes
could
vary

considerably
when
evalua6ng
DCs
as
a
bulk
due
to
heterogeneity
in
the
cell
popula6on

composi6on
we
opted
to
perform
a
magne6c
cell
enrichment
of
the
DC
popula6on
with
the

Blood
Dendri6c
Cell
Isola6on
Kit
II,
human
(Miltenyi
Biotec)
which
results
in
an
enriched
cell

frac6on
comprising
plasmacytoid
dendri6c
cells,
CD1c
(BDCA-‐1)+
type-‐1
myeloid
dendri6c

cells
(MDC1s),
and
CD1c
(BDCA-‐1)-‐
CD141
(BDCA-‐3)
bright
type-‐2
myeloid
dendri6c
cells

(MDC2s).
This
enriched
DC
frac6on
will
have
its
cell
content
profiled
via
flow
cytometry

analysis
followed
by
single
cell
RNA-‐Seq
(2
x
96
cells
per
individual
res6ng
vs.
infected
cell

samples)
using
a
Fluidigm
plajorm,
which
will
also
help
to
improve
reQTL
analysis
resolu6on

(11).

4.
ComputaJonal
Analyses:

The
computa6on
analyses
for
the
transcriptomic
and
genomic
datasets
are
described
bellow

and
their
respec6ve
flow
charts
can
be
found
in
annexe
at
the
end
of
this
document.

4.1.
Fastq
files

Both
the
transcriptome
and
genome
assembly
pipelines
use
as
input
fastq
format
files.
Fastq

file
is
a
text
file,
which
contains
the
raw
informa6on
from
each
read
coming
out
of
next

genera6on
sequencing
experiment.
Each
read
is
represented
as
four
text
lines.
The
first
line

starts
with
an
@
sign
followed
by
the
sequence
iden6fier
with
some
op6onal
descrip6on.

The
second
line
contains
the
actual
nucleo6de
sequence
of
the
read.
The
third
line
contains

only
a
+
sign
that
marks
the
beginning
of
the
nucleo6de
quality
scores
in
the
fourth
line.

Each
nucleo6de
is
associated
with
a
Phred
quality
score
that
es6mates
its
reliability.
These

quality
scores
are
coded
in
ASCII
code
and
usually
the
aligners
assumes
Sanger
format

encoding
Phred+33
as
default
as
this
is
the
current
standard
format
since
Illumina
1.8.

4.2
DifferenJal
Expression
RNA-‐Seq
(Tuxedo)
Pipeline
(View
annexed
figure
1-‐A)

1)
TopHat
-‐
RNA-‐Seq
alignment:
High
quality
single-‐end
RNA-‐seq
reads
(fastq
files
as
input)

for
all
the
single
cell
samples
will
be
mapped
to
a
human
reference
genome
(hg19
build)

using
TopHat
aligner
(13).
TopHat
is
an
alignment
program
which
has
been
specifically

designed
for
the
analyses
of
RNA-‐Seq
data
and
is
therefore
able
to
map
reads
to
the
genome

even
when
the
reads
span
splice
junc6ons
whose
genomic
regions
can
be
separated
by

rela6vely
large
intronic
regions.
TopHat
produces
the
following
output
files
for
each
sample

aligned:
a)
align_summary.txt,
b)
inser6ons.bed,
c)
dele6ons.bed,
d)
splice_junc6ons.bed

and
e)
accepted_hits.bam.

2)
Cufflinks
-‐
Transcripts
assembly:
The
mapped
reads
for
the
expressed
genes
and

transcripts
will
be
assembled
for
each
sample
(accepted_hits.bam
files)
using
Cufflinks
and
a

reference
gene
annota6on
(hg19.gj)
to
es6mate
isoform
expression.
Cufflinks
is
both
the

name
of
a
suite
of
tools
and
a
program
within
that
suite.
The
program
Cufflinks
assembles

transcriptomes
from
RNA-‐Seq
data
and
quan6fies
their
expression
producing
the
following

output
files
for
each
sample:
a)
gene_expression.tabular,
b)
transcript_expression.tabular,
c)

assembled_transcripts.gj
and
d)
skipped_transcripts.gj.

2

3-‐4)
Cuffmerge
-‐
Merge
assemblies:
A
file
called
assemblies.txt
that
lists
the
assembly
file

for
each
sample
(assembled_transcripts.gj)
will
be
created.
This
file
is
used
for
running

Cuffmerge
on
all
the
assemblies
(assemblies.txt)
to
create
a
single
merged
transcriptome

annota6on
using
the
references
for
gene
annota6on
(hg19.gj)
and
genomic
regions

(hg19_genome.fasta).
The
output
of
Cuffmerge
is
a
single
GTF
file
that
contains
an
assembly

that
merges
together
all
the
input
assemblies.

5)
Cuffdiff
-‐
DifferenJal
expression
inference:
We
will
run
Cuffdiff
using
as
input
the
merged

transcriptome
assembly
GTF
file
along
with
the
BAM
files
(accepted_hits.bam)
from
TopHat

for
each
sample.
Cuffdiff
is
used
for
finding
significant
changes
in
transcript
expression,

splicing,
and
promoter
use
and
produces
11
output
files:
a.
Transcript
FPKM
(+count)

expression
tracking,
b.
Gene
FPKM
(+count)
expression
tracking,
c.
Primary
transcript
FPKM

(+count)
tracking,
d.
Coding
sequence
FPKM
(+count)
tracking,
e.
Transcript
differen6al

FPKM,
f.
Gene
differen6al
FPKM,
g.
Primary
transcript
differen6al
FPKM,
h.
Coding
sequence

differen6al
FPKM,
i.
Differen6al
splicing
tests,
j.
Differen6al
promoter
tests,
k.
Differen6al

CDS
tests.

6-‐18)
DifferenJal
expression
analysis:
The
differen6al
expression
analysis
results
will
be

explored
with
CummeRbund
in
R
environment.
Cummerbund
uses
all
the
output
files
from

Cuffdiff
to
create
a
SQLite
database
of
results
with
the
descrip6on
of
the
rela6onship

between
genes,
transcripts,
transcrip6on
start
sites
and
CDS
regions.
The
stored
and
indexed

data
can
be
used
for
exploring
sub
features
of
individual
genes
or
gene
sets
and
used
for

plot
visualisa6ons
of
the
data.

4.3.
Genome
Re-‐Sequencing
Assembly
and
Variant
Calling
(View
annexed
figure
1-‐B)

1)
BWA-‐MEM
-‐
Map
genomic
reads
for
each
subject:
BWA
is
a
soTware
package
for

mapping
low-‐divergent
sequences
against
a
large
reference
genome,
such
as
the
human

genome.
BWA-‐MEM
is
the
latest
algorithm
which
was
designed
for
fast
and
accurate

mapping
of
high
quality
Illumina
sequence
reads
ranging
from
70bp
to
1Mbp
(14).
High

quality
paired-‐end
reads
in
a
fastq
file
format
are
used
as
input
together
with
a
reference

genome
fasta
file
such
as
the
human
genome
hg19
build.
BWA-‐MEM
produces
a

compressed
binary
BAM
file
as
an
output.

2)
Picard,
AddOrReplaceReadGroups
tool
-‐
label
read
groups:
We
use
this
Picard
tool

func6on
to
label
the
reads
from
each
sample
in
the
BAM
files
before
merging
them
for

further
analysis.

3)
Picard,

MergeSamFiles
-‐
Merge
the
read
group
labelled
BAM
files:
MergeSamFiles
is
also

a
component
of
Picard
and
is
used
for
merging
mul6ple
SAM/BAM
files
into
one
file.

4)
Samtools,
Filter
reads:
Samtools
is
used
to
filter
high
quality
mapped
and
proper
paired

reads.

5)
Picard,
Paired
Read
Mate
Fixer
-‐
Sort
reads
by
coordinates:
This
Picard
func6on
can
be

used
to
adjust
the
ordering
of
reads.

6)
Picard,
MarkDuplicates
-‐
Remove
all
duplicated
reads:
We
use
the
func6on

MarkDuplicates
to
remove
duplicates
which
are
amplifica6on
artefacts
from
library

prepara6on.

7)
FreeBayes
-‐
Call
variants:
FreeBayes
(15)
is
a
Bayesian
based
gene6c
variant
detector

designed
to
find
small
polymorphisms
(SNPs,
indels,
MNPs
and
complex
events)
smaller
than

the
length
of
a
short-‐read
sequencing
alignment.
The
processed
short-‐read
alignment
BAM

3

files
with
Phred+33
encoded
quality
scores
and
a
reference
genome
in
fasta
format
are
used

by
FreeBayes
to
determine
the
most-‐likely
haplotype
for
the
individuals
at
each
posi6on
in

the
reference.
The
output
of
this
variant
caller
is
a
variant
call
file
(VCF)
format
that
reports

the
posi6ons,
which
it
finds
puta6vely
polymorphic.

8)
VCFlib
VCFfilter
-‐
Filter
for
high
confidence
and
high
coverage
variant
calls:

We
use

VCFfilter
to
select
only
high
confidence
variant
calls
based
on
Phred
score
“QUAL

40
(false

discovery
rate
(FDR)
of
1
in
10,000)
and
enough
read
coverage
depth
“DP

5”.

9)
ANNOVAR
-‐
Annotate
variants:
Finally
ANNOVAR
(16)
is
used
to
func6onally
annotate
the

gene6c
variants
detected
and
iden6fy
variants
that
are
documented
in
specific
databases

such
as
dbSNP
and
reports
its
allele
frequency
base
on
the
1000
Genome
Project,
NHLBI-‐ESP

6500
exomes
or
Exome
Aggrega6on
Consor6um

5.
Expected
Data

InterpretaJon
of
Data

We
will
search
for
the
top
5
most
significant
FLU
specific
reQTLs
characterised
by
Lee
et.
al.

(1)
within
the
variants
iden6fied
from
our
10
genotyped
candidates.
The
presence
of
such

variants

will
be
correlated
with
a
significant
up-‐regula6on
of
their
associated
genes
in
the

single
cell
samples
analysed
aTer
influenza
infec6on
in
comparison
to
unperturbed
state

(Table
1).
For
this
correla6on
we
will
take
in
considera6on
any
heterogeneity
of
cellular

composi6on
as
covariates
for
adjus6ng
the
associated
response.
Ul6mately
the
top
5

candidates
that
has
the
highest
degrees
of
reQTL
correla6ons
with
the
most
significant
gene

up-‐regula6ons
aTer
influenza
infec6on
will
be
deemed
the
most
fit
responders
for
our

proposed
mission.

Table1
:
Top
5
FLU
specific
reQTL
and
the
associated
genes
with
the
most
significantly
up-‐
regulated
expression
in
response
to
specific
to
influenza
virus
characterised
by
Lee
et
al.
(1).

The
SNPs/reQTLs
associated
genes
were
sorted
based
on
M-‐value

0.9
and
M-‐value

0.1

inclusion
and
exclusion
criteria
respec6vely
and
on
their
significance
levels
using
data
from

supplementary
table
4
sheet
I.
Delta
Meta
from
Lee
et
al.
(1).

SNP
ID Gene
reQTL
FLU

p-‐value
FLU

M-‐value
LPS

p-‐value
LPS

M-‐value
IFN

p-‐value
exm-‐rs1019503 ERAP2 TRUE 4.0217E-‐212 1 7.30441E-‐11 0 5.34953E-‐32
rs6752483 ADCY3 TRUE 4.92854E-‐24 1 0.000746454 0.081 0.559143
rs2285712 CCDC109B TRUE 2.32421E-‐20 1 0.0845543 0 0.407339
rs2834160 IFNAR2 TRUE 4.09777E-‐20 1 0.027355 0.001 0.0984183
rs1477478 IFNA21 TRUE 4.28414E-‐18 1 0.351075 0 0.891976
4

References

1.
M.
N.
Lee
et
al.,
Common
gene6c
variants
modulate
pathogen-‐sensing
responses
in

human
dendri6c
cells.
Science.
343,
1246980
(2014).

2.
P.
L.
De
Jager
et
al.,
ImmVar
project:
Insights
and
design
considera6ons
for
future

studies
of
“healthy”
immune
varia6on.
Semin.
Immunol.
27,
51–57
(2015).

3.
D.
Altshuler,
M.
J.
Daly,
E.
S.
Lander,
Gene6c
mapping
in
human
disease.
Science.
322,

881–888
(2008).

4.
W.
G.
Feero,
A.
E.
GuDmacher,
T.
A.
Manolio,
Genomewide
Associa6on
Studies
and

Assessment
of
the
Risk
of
Disease.
N
Engl
J
Med.
363,
166–176
(2010).

5.
M.
A.
Schaub,
A.
P.
Boyle,
A.
Kundaje,
S.
Batzoglou,
M.
Snyder,
Linking
disease

associa6ons
with
regulatory
informa6on
in
the
human
genome.
Genome
Res.
22,

1748–1759
(2012).

6.
I.
Gat-‐Viks
et
al.,
Deciphering
molecular
circuits
from
gene6c
varia6on
underlying

transcrip6onal
responsiveness
to
s6muli.
Nat.
Biotechnol.
31,
342–349
(2013).

7.
G.
Gibson,
J.
E.
Powell,
U.
M.
Marigorta,
Expression
quan6ta6ve
trait
locus
analysis
for

transla6onal
medicine.
Genome
Med.
7,
60
(2015).

8.
B.
P.
Fairfax,
J.
C.
Knight,
Gene6cs
of
gene
expression
in
immunity
to
infec6on.
Curr

Opin
Immunol.
30,
63–71
(2014).

9.
M.
Çalışkan,
S.
W.
Baker,
Y.
Gilad,
C.
Ober,
Host
gene6c
varia6on
influences
gene

expression
response
to
rhinovirus
infec6on.
PLoS
Genet.
11,
e1005111
(2015).

10.
D.
Chaussabel,
Assessment
of
immune
status
using
blood
transcriptomics
and

poten6al
implica6ons
for
global
health.
Semin.
Immunol.
27,
58–66
(2015).

11.
H.-‐J.
Westra,
L.
Franke,
From
genome
to
func6on
by
studying
eQTLs.
Biochim.
Biophys.

Acta.
1842,
1896–1902
(2014).

12.
C.
Trapnell
et
al.,
Differen6al
gene
and
transcript
expression
analysis
of
RNA-‐seq

experiments
with
TopHat
and
Cufflinks.
Nat
Protoc.
7,
562–578
(2012).

13.
C.
Trapnell
et
al.,
Transcript
assembly
and
quan6fica6on
by
RNA-‐Seq
reveals

unannotated
transcripts
and
isoform
switching
during
cell
differen6a6on.
Nat.

Biotechnol.
28,
511–515
(2010).

14.
H.
Li,
Aligning
sequence
reads,
clone
sequences
and
assembly
con6gs
with
BWA-‐
MEM.
arXiv
(2013).

15.
E.
Garrison,
G.
Marth,
Haplotype-‐based
variant
detec6on
from
short-‐read
sequencing.

arXiv
(2012).

16.
K.
Wang,
M.
Li,
H.
Hakonarson,
ANNOVAR:
func6onal
annota6on
of
gene6c
variants

from
high-‐throughput
sequencing
data.
Nucleic
Acids
Res.
38,
e164–e164
(2010).

5

Annexed
Figure
1

A-‐Tuxedo
DifferenJal
Expression
RNA-‐Seq
Pipeline
flowchart

extracted
from
(12)

B-‐Genome
re-‐sequencing
assembly
and
variant
calling
workflow
BWA-MEM
Map$genomic$reads
Picard
Add$read$groups
Picard
Merge$BAM$files
Samtools
Filter$reads
Picard
Sort$by$coordinates
Picard
Remove$duplicates
FreeBayes
Call$variants
VCFlib
VCF$filter
ANNOVAR
Annotate$variants
Annotated high
quality variants
Paired reads
BAM files
Fastq files
RG labeled BAM files
Files merged into one BAM file
High quality BAM file
Sorted BAM file
Dedup BAM file
VCF file
High quality calls (VCF)
VCF file
Genome reference
hg19.fasta
6

FabioAmaralProject 3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to FabioAmaralProject 3

Similar to FabioAmaralProject 3 (20)

FabioAmaralProject 3