SlideShare a Scribd company logo
1 of 6
Download to read offline
Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
1.	
  Project	
  Goal	
  
This	
  project	
  is	
  aimed	
  at	
  se0ng	
  an	
  experimental	
  and	
  analy6cal	
  workflow	
  for	
  classifying	
  
individuals	
  based	
  on	
  the	
  predicted	
  robustness	
  of	
  their	
  innate	
  immune	
  response	
  against	
  
influenza	
  infec6on.	
  A	
  total	
  of	
  10	
  physician	
  volunteers	
  will	
  be	
  evaluated	
  and	
  the	
  top	
  five	
  
ranked	
  candidates	
  will	
  be	
  selected	
  to	
  aDend	
  a	
  humanitarian	
  mission	
  at	
  a	
  region	
  affected	
  by	
  
influenza	
  outbreak.	
  The	
  volunteers	
  will	
  be	
  required	
  to	
  donate	
  a	
  blood	
  sample	
  which	
  will	
  be	
  
used	
  to	
  assess	
  the	
  levels	
  viral	
  response	
  via	
  an	
  integrated	
  genomic	
  and	
  gene	
  expression	
  
signature	
  analysis	
  based	
  on	
  the	
  seminal	
  work	
  set	
  by	
  Lee	
  et	
  al.	
  (1)	
  performed	
  as	
  part	
  of	
  the	
  
Phenogene6c	
  Project	
  and	
  ImmVar	
  Consor6um	
  (2).	
  	
  
2.	
  Background	
  
Genome-­‐wide	
  associa6on	
  studies	
  (GWAS)	
  have	
  been	
  effec6ve	
  in	
  iden6fying	
  common	
  
gene6c	
  variants	
  which	
  confer	
  suscep6bility	
  to	
  complex	
  diseases	
  and	
  other	
  phenotypes	
  of	
  
interest	
  (3).	
  However	
  GWAS	
  usually	
  fail	
  in	
  pinpoin6ng	
  the	
  causa6ve	
  mechanisms	
  that	
  lead	
  to	
  
such	
  traits.	
  This	
  is	
  mainly	
  due	
  to	
  variants	
  oTen	
  having	
  small	
  effect	
  sizes,	
  linkage	
  
disequilibrium	
  between	
  associated	
  variants,	
  liDle	
  heritability	
  and	
  varied	
  influences	
  from	
  the	
  
environment	
  (4).	
  	
  
Interes6ngly,	
  the	
  vast	
  majority	
  of	
  disease	
  associated	
  single	
  nucleo6de	
  polymorphisms	
  (SNP)	
  
are	
  mapped	
  to	
  the	
  non-­‐coding	
  vicinity	
  of	
  genes,	
  oTen	
  affec6ng	
  their	
  level	
  of	
  expression	
  (5).	
  
Variants	
  that	
  exert	
  their	
  regulatory	
  effect	
  on	
  the	
  steady-­‐state	
  levels	
  of	
  expression	
  have	
  been	
  
called	
  expression	
  Quan6ta6ve	
  Trait	
  Loci	
  (eQTL)	
  or	
  response	
  Quan6ta6ve	
  Trait	
  Loci	
  (reQTL)	
  
when	
  the	
  expression	
  interference	
  is	
  dependent	
  on	
  a	
  s6mulus(6)	
  and	
  these	
  QTLs	
  may	
  
manifest	
  themselves	
  in	
  a	
  6ssue	
  or	
  cell-­‐type	
  specific	
  manner	
  (7).	
  	
  
Context	
  specific	
  reQTLs	
  are	
  of	
  extreme	
  relevance	
  in	
  immunity	
  to	
  infec6on	
  since	
  the	
  
quan6ta6ve	
  changes	
  in	
  gene	
  expression	
  alter	
  the	
  outcome	
  of	
  immune	
  responses	
  to	
  a	
  wide	
  
variety	
  of	
  perturba6ons	
  such	
  as	
  infec6on	
  with	
  various	
  pathogens	
  and	
  vaccina6on(8).	
  
Moreover,	
  some	
  viral	
  infec6ons	
  (e.g.	
  rhinovirus)	
  can	
  lead	
  to	
  a	
  wide	
  spectrum	
  of	
  symptoms	
  
severity	
  which	
  is,	
  at	
  least	
  in	
  part,	
  influenced	
  by	
  the	
  inter-­‐individual	
  reQTL	
  varia6ons	
  from	
  the	
  
hosts	
  (9).	
  	
  
Furthermore,	
  different	
  pathogen	
  s6muli	
  can	
  affect	
  shared	
  reQTL	
  associated	
  genes	
  or	
  specific	
  
ones	
  as	
  has	
  been	
  shown	
  in	
  a	
  recent	
  study	
  which	
  assessed	
  the	
  effect	
  of	
  s6mula6on	
  of	
  
dendri6c	
  cells	
  (DC)	
  with	
  bacterial	
  lipopolysaccharide	
  (LPS)	
  ,	
  influenza	
  virus	
  and	
  interferon-­‐β	
  
(1).	
  Of	
  the	
  commonly	
  found	
  121	
  reQTLs	
  (minor	
  allele	
  frequency	
  >5%)	
  iden6fied	
  in	
  this	
  study	
  
only	
  7	
  loci	
  had	
  an	
  effect	
  on	
  near	
  by	
  genes	
  specifically	
  in	
  response	
  to	
  influenza	
  infec6on	
  
which	
  make	
  these	
  reQTLs	
  interes6ng	
  parameters	
  for	
  the	
  assessment	
  of	
  the	
  response	
  
effec6veness	
  to	
  such	
  viral	
  infec6ons.	
  
Decisively,	
  the	
  profiling	
  of	
  blood	
  transcriptomics	
  integrated	
  with	
  genomics	
  scale	
  variant	
  
genotyping	
  provides	
  an	
  aDrac6ve	
  mean	
  for	
  evalua6ng	
  the	
  immune	
  status	
  of	
  individuals	
  (10).	
  
Therefore	
  we	
  propose	
  to	
  test	
  for	
  the	
  presence	
  of	
  the	
  most	
  significant	
  5	
  influenza	
  specific	
  
reQTLs	
  (Table	
  1)	
  and	
  the	
  up-­‐regula6on	
  levels	
  of	
  the	
  associated	
  genes	
  as	
  a	
  proxy	
  for	
  a	
  robust	
  
immune	
  response	
  against	
  influenza	
  in	
  order	
  to	
  select	
  five	
  physician	
  candidates	
  who	
  would	
  
be	
  best	
  suited	
  to	
  aDend	
  a	
  popula6on	
  affected	
  by	
  a	
  flu	
  outbreak.	
  
3.	
  Experimental	
  approach:	
  
In	
  order	
  to	
  have	
  comparable	
  results	
  we	
  will	
  use	
  the	
  experimental	
  approach	
  set	
  by	
  Lee	
  et	
  al.	
  
(1)	
  with	
  some	
  modifica6ons	
  to	
  account	
  for	
  the	
  current	
  technological	
  improvements	
  as	
  
described	
  below.	
  	
  
1
Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
3.1.	
  Whole	
  genome	
  re-­‐sequencing/variant	
  calling:	
  The	
  peripheral	
  blood	
  mononuclear	
  cells	
  
(PBMC)	
  isolated	
  from	
  the	
  volunteers’	
  blood	
  will	
  be	
  used	
  for	
  variant	
  calling	
  by	
  next	
  
genera6on	
  whole	
  genome	
  re-­‐sequencing	
  to	
  obtain	
  improved	
  variant	
  resolu6on.	
  DCs	
  will	
  be	
  
derived	
  from	
  PBMCs	
  and	
  infected	
  with	
  influenza	
  virus	
  as	
  described	
  in	
  the	
  original	
  Science	
  
report.	
  	
  
3.2.	
  DC	
  enrichment/single	
  cell	
  RNA-­‐Seq:	
  Since	
  the	
  observed	
  reQTLs	
  effect	
  sizes	
  could	
  vary	
  
considerably	
  when	
  evalua6ng	
  DCs	
  as	
  a	
  bulk	
  due	
  to	
  heterogeneity	
  in	
  the	
  cell	
  popula6on	
  
composi6on	
  we	
  opted	
  to	
  perform	
  a	
  magne6c	
  cell	
  enrichment	
  of	
  the	
  DC	
  popula6on	
  with	
  the	
  
Blood	
  Dendri6c	
  Cell	
  Isola6on	
  Kit	
  II,	
  human	
  (Miltenyi	
  Biotec)	
  which	
  results	
  in	
  an	
  enriched	
  cell	
  
frac6on	
  comprising	
  plasmacytoid	
  dendri6c	
  cells,	
  CD1c	
  (BDCA-­‐1)+	
  type-­‐1	
  myeloid	
  dendri6c	
  
cells	
  (MDC1s),	
  and	
  CD1c	
  (BDCA-­‐1)-­‐	
  CD141	
  (BDCA-­‐3)	
  bright	
  type-­‐2	
  myeloid	
  dendri6c	
  cells	
  
(MDC2s).	
  This	
  enriched	
  DC	
  frac6on	
  will	
  have	
  its	
  cell	
  content	
  profiled	
  via	
  flow	
  cytometry	
  
analysis	
  followed	
  by	
  single	
  cell	
  RNA-­‐Seq	
  (2	
  x	
  96	
  cells	
  per	
  individual	
  res6ng	
  vs.	
  infected	
  cell	
  
samples)	
  using	
  a	
  Fluidigm	
  plajorm,	
  which	
  will	
  also	
  help	
  to	
  improve	
  reQTL	
  analysis	
  resolu6on	
  
(11).	
  
4.	
  ComputaJonal	
  Analyses:	
  	
  
The	
  computa6on	
  analyses	
  for	
  the	
  transcriptomic	
  and	
  genomic	
  datasets	
  are	
  described	
  bellow	
  
and	
  their	
  respec6ve	
  flow	
  charts	
  can	
  be	
  found	
  in	
  annexe	
  at	
  the	
  end	
  of	
  this	
  document.	
  	
  
4.1.	
  Fastq	
  files	
  
Both	
  the	
  transcriptome	
  and	
  genome	
  assembly	
  pipelines	
  use	
  as	
  input	
  fastq	
  format	
  files.	
  Fastq	
  
file	
  is	
  a	
  text	
  file,	
  which	
  contains	
  the	
  raw	
  informa6on	
  from	
  each	
  read	
  coming	
  out	
  of	
  next	
  
genera6on	
  sequencing	
  experiment.	
  Each	
  read	
  is	
  represented	
  as	
  four	
  text	
  lines.	
  The	
  first	
  line	
  
starts	
  with	
  an	
  @	
  sign	
  followed	
  by	
  the	
  sequence	
  iden6fier	
  with	
  some	
  op6onal	
  descrip6on.	
  
The	
  second	
  line	
  contains	
  the	
  actual	
  nucleo6de	
  sequence	
  of	
  the	
  read.	
  The	
  third	
  line	
  contains	
  
only	
  a	
  +	
  sign	
  that	
  marks	
  the	
  beginning	
  of	
  the	
  nucleo6de	
  quality	
  scores	
  in	
  the	
  fourth	
  line.	
  
Each	
  nucleo6de	
  is	
  associated	
  with	
  a	
  Phred	
  quality	
  score	
  that	
  es6mates	
  its	
  reliability.	
  These	
  
quality	
  scores	
  are	
  coded	
  in	
  ASCII	
  code	
  and	
  usually	
  the	
  aligners	
  assumes	
  Sanger	
  format	
  
encoding	
  Phred+33	
  as	
  default	
  as	
  this	
  is	
  the	
  current	
  standard	
  format	
  since	
  Illumina	
  1.8.	
  	
  
4.2	
  DifferenJal	
  Expression	
  RNA-­‐Seq	
  (Tuxedo)	
  Pipeline	
  (View	
  annexed	
  figure	
  1-­‐A)	
  
1)	
  TopHat	
  -­‐	
  RNA-­‐Seq	
  alignment:	
  High	
  quality	
  single-­‐end	
  RNA-­‐seq	
  reads	
  (fastq	
  files	
  as	
  input)	
  
for	
  all	
  the	
  single	
  cell	
  samples	
  will	
  be	
  mapped	
  to	
  a	
  human	
  reference	
  genome	
  (hg19	
  build)	
  
using	
  TopHat	
  aligner	
  (13).	
  TopHat	
  is	
  an	
  alignment	
  program	
  which	
  has	
  been	
  specifically	
  
designed	
  for	
  the	
  analyses	
  of	
  RNA-­‐Seq	
  data	
  and	
  is	
  therefore	
  able	
  to	
  map	
  reads	
  to	
  the	
  genome	
  
even	
  when	
  the	
  reads	
  span	
  splice	
  junc6ons	
  whose	
  genomic	
  regions	
  can	
  be	
  separated	
  by	
  
rela6vely	
  large	
  intronic	
  regions.	
  TopHat	
  produces	
  the	
  following	
  output	
  files	
  for	
  each	
  sample	
  
aligned:	
  a)	
  align_summary.txt,	
  b)	
  inser6ons.bed,	
  c)	
  dele6ons.bed,	
  d)	
  splice_junc6ons.bed	
  
and	
  e)	
  accepted_hits.bam.	
  
2)	
  Cufflinks	
  -­‐	
  Transcripts	
  assembly:	
  The	
  mapped	
  reads	
  for	
  the	
  expressed	
  genes	
  and	
  
transcripts	
  will	
  be	
  assembled	
  for	
  each	
  sample	
  (accepted_hits.bam	
  files)	
  using	
  Cufflinks	
  and	
  a	
  
reference	
  gene	
  annota6on	
  (hg19.gj)	
  to	
  es6mate	
  isoform	
  expression.	
  Cufflinks	
  is	
  both	
  the	
  
name	
  of	
  a	
  suite	
  of	
  tools	
  and	
  a	
  program	
  within	
  that	
  suite.	
  The	
  program	
  Cufflinks	
  assembles	
  
transcriptomes	
  from	
  RNA-­‐Seq	
  data	
  and	
  quan6fies	
  their	
  expression	
  producing	
  the	
  following	
  
output	
  files	
  for	
  each	
  sample:	
  a)	
  gene_expression.tabular,	
  b)	
  transcript_expression.tabular,	
  c)	
  
assembled_transcripts.gj	
  and	
  d)	
  skipped_transcripts.gj.	
  
2
Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
3-­‐4)	
  Cuffmerge	
  -­‐	
  Merge	
  assemblies:	
  A	
  file	
  called	
  assemblies.txt	
  that	
  lists	
  the	
  assembly	
  file	
  
for	
  each	
  sample	
  (assembled_transcripts.gj)	
  will	
  be	
  created.	
  This	
  file	
  is	
  used	
  for	
  running	
  
Cuffmerge	
  on	
  all	
  the	
  assemblies	
  (assemblies.txt)	
  to	
  create	
  a	
  single	
  merged	
  transcriptome	
  
annota6on	
  using	
  the	
  references	
  for	
  gene	
  annota6on	
  (hg19.gj)	
  and	
  genomic	
  regions	
  
(hg19_genome.fasta).	
  The	
  output	
  of	
  Cuffmerge	
  is	
  a	
  single	
  GTF	
  file	
  that	
  contains	
  an	
  assembly	
  
that	
  merges	
  together	
  all	
  the	
  input	
  assemblies.	
  
5)	
  Cuffdiff	
  -­‐	
  DifferenJal	
  expression	
  inference:	
  We	
  will	
  run	
  Cuffdiff	
  using	
  as	
  input	
  the	
  merged	
  
transcriptome	
  assembly	
  GTF	
  file	
  along	
  with	
  the	
  BAM	
  files	
  (accepted_hits.bam)	
  from	
  TopHat	
  
for	
  each	
  sample.	
  Cuffdiff	
  is	
  used	
  for	
  finding	
  significant	
  changes	
  in	
  transcript	
  expression,	
  
splicing,	
  and	
  promoter	
  use	
  and	
  produces	
  11	
  output	
  files:	
  a.	
  Transcript	
  FPKM	
  (+count)	
  
expression	
  tracking,	
  b.	
  Gene	
  FPKM	
  (+count)	
  expression	
  tracking,	
  c.	
  Primary	
  transcript	
  FPKM	
  
(+count)	
  tracking,	
  d.	
  Coding	
  sequence	
  FPKM	
  (+count)	
  tracking,	
  e.	
  Transcript	
  differen6al	
  
FPKM,	
  f.	
  Gene	
  differen6al	
  FPKM,	
  g.	
  Primary	
  transcript	
  differen6al	
  FPKM,	
  h.	
  Coding	
  sequence	
  
differen6al	
  FPKM,	
  i.	
  Differen6al	
  splicing	
  tests,	
  j.	
  Differen6al	
  promoter	
  tests,	
  k.	
  Differen6al	
  
CDS	
  tests.	
  
6-­‐18)	
  DifferenJal	
  expression	
  analysis:	
  The	
  differen6al	
  expression	
  analysis	
  results	
  will	
  be	
  
explored	
  with	
  CummeRbund	
  in	
  R	
  environment.	
  Cummerbund	
  uses	
  all	
  the	
  output	
  files	
  from	
  
Cuffdiff	
  to	
  create	
  a	
  SQLite	
  database	
  of	
  results	
  with	
  the	
  descrip6on	
  of	
  the	
  rela6onship	
  
between	
  genes,	
  transcripts,	
  transcrip6on	
  start	
  sites	
  and	
  CDS	
  regions.	
  The	
  stored	
  and	
  indexed	
  
data	
  can	
  be	
  used	
  for	
  exploring	
  sub	
  features	
  of	
  individual	
  genes	
  or	
  gene	
  sets	
  and	
  used	
  for	
  
plot	
  visualisa6ons	
  of	
  the	
  data.	
  	
  
4.3.	
  Genome	
  Re-­‐Sequencing	
  Assembly	
  and	
  Variant	
  Calling	
  (View	
  annexed	
  figure	
  1-­‐B)	
  
1)	
  BWA-­‐MEM	
  -­‐	
  Map	
  genomic	
  reads	
  for	
  each	
  subject:	
  BWA	
  is	
  a	
  soTware	
  package	
  for	
  
mapping	
  low-­‐divergent	
  sequences	
  against	
  a	
  large	
  reference	
  genome,	
  such	
  as	
  the	
  human	
  
genome.	
  BWA-­‐MEM	
  is	
  the	
  latest	
  algorithm	
  which	
  was	
  designed	
  for	
  fast	
  and	
  accurate	
  
mapping	
  of	
  high	
  quality	
  Illumina	
  sequence	
  reads	
  ranging	
  from	
  70bp	
  to	
  1Mbp	
  (14).	
  High	
  
quality	
  paired-­‐end	
  reads	
  in	
  a	
  fastq	
  file	
  format	
  are	
  used	
  as	
  input	
  together	
  with	
  a	
  reference	
  
genome	
  fasta	
  file	
  such	
  as	
  the	
  human	
  genome	
  hg19	
  build.	
  BWA-­‐MEM	
  produces	
  a	
  
compressed	
  binary	
  BAM	
  file	
  as	
  an	
  output.	
  
2)	
  Picard,	
  AddOrReplaceReadGroups	
  tool	
  -­‐	
  label	
  read	
  groups:	
  We	
  use	
  this	
  Picard	
  tool	
  
func6on	
  to	
  label	
  the	
  reads	
  from	
  each	
  sample	
  in	
  the	
  BAM	
  files	
  before	
  merging	
  them	
  for	
  
further	
  analysis.	
  	
  
3)	
  Picard,	
  	
  MergeSamFiles	
  -­‐	
  Merge	
  the	
  read	
  group	
  labelled	
  BAM	
  files:	
  MergeSamFiles	
  is	
  also	
  
a	
  component	
  of	
  Picard	
  and	
  is	
  used	
  for	
  merging	
  mul6ple	
  SAM/BAM	
  files	
  into	
  one	
  file.	
  
4)	
  Samtools,	
  Filter	
  reads:	
  Samtools	
  is	
  used	
  to	
  filter	
  high	
  quality	
  mapped	
  and	
  proper	
  paired	
  
reads.	
  
5)	
  Picard,	
  Paired	
  Read	
  Mate	
  Fixer	
  -­‐	
  Sort	
  reads	
  by	
  coordinates:	
  This	
  Picard	
  func6on	
  can	
  be	
  
used	
  to	
  adjust	
  the	
  ordering	
  of	
  reads.	
  
6)	
  Picard,	
  MarkDuplicates	
  -­‐	
  Remove	
  all	
  duplicated	
  reads:	
  We	
  use	
  the	
  func6on	
  
MarkDuplicates	
  to	
  remove	
  duplicates	
  which	
  are	
  amplifica6on	
  artefacts	
  from	
  library	
  
prepara6on.	
  
7)	
  FreeBayes	
  -­‐	
  Call	
  variants:	
  FreeBayes	
  (15)	
  is	
  a	
  Bayesian	
  based	
  gene6c	
  variant	
  detector	
  
designed	
  to	
  find	
  small	
  polymorphisms	
  (SNPs,	
  indels,	
  MNPs	
  and	
  complex	
  events)	
  smaller	
  than	
  
the	
  length	
  of	
  a	
  short-­‐read	
  sequencing	
  alignment.	
  The	
  processed	
  short-­‐read	
  alignment	
  BAM	
  
3
Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
files	
  with	
  Phred+33	
  encoded	
  quality	
  scores	
  and	
  a	
  reference	
  genome	
  in	
  fasta	
  format	
  are	
  used	
  
by	
  FreeBayes	
  to	
  determine	
  the	
  most-­‐likely	
  haplotype	
  for	
  the	
  individuals	
  at	
  each	
  posi6on	
  in	
  
the	
  reference.	
  The	
  output	
  of	
  this	
  variant	
  caller	
  is	
  a	
  variant	
  call	
  file	
  (VCF)	
  format	
  that	
  reports	
  
the	
  posi6ons,	
  which	
  it	
  finds	
  puta6vely	
  polymorphic.	
  	
  
8)	
  VCFlib	
  VCFfilter	
  -­‐	
  Filter	
  for	
  high	
  confidence	
  and	
  high	
  coverage	
  variant	
  calls:	
  	
  We	
  use	
  
VCFfilter	
  to	
  select	
  only	
  high	
  confidence	
  variant	
  calls	
  based	
  on	
  Phred	
  score	
  “QUAL	
  	
  40	
  (false	
  
discovery	
  rate	
  (FDR)	
  of	
  1	
  in	
  10,000)	
  and	
  enough	
  read	
  coverage	
  depth	
  “DP	
  	
  5”.	
  
9)	
  ANNOVAR	
  -­‐	
  Annotate	
  variants:	
  Finally	
  ANNOVAR	
  (16)	
  is	
  used	
  to	
  func6onally	
  annotate	
  the	
  
gene6c	
  variants	
  detected	
  and	
  iden6fy	
  variants	
  that	
  are	
  documented	
  in	
  specific	
  databases	
  
such	
  as	
  dbSNP	
  and	
  reports	
  its	
  allele	
  frequency	
  base	
  on	
  the	
  1000	
  Genome	
  Project,	
  NHLBI-­‐ESP	
  
6500	
  exomes	
  or	
  Exome	
  Aggrega6on	
  Consor6um	
  
5.	
  Expected	
  Data	
  	
  InterpretaJon	
  of	
  Data	
  
We	
  will	
  search	
  for	
  the	
  top	
  5	
  most	
  significant	
  FLU	
  specific	
  reQTLs	
  characterised	
  by	
  Lee	
  et.	
  al.	
  
(1)	
  within	
  the	
  variants	
  iden6fied	
  from	
  our	
  10	
  genotyped	
  candidates.	
  The	
  presence	
  of	
  such	
  
variants	
  	
  will	
  be	
  correlated	
  with	
  a	
  significant	
  up-­‐regula6on	
  of	
  their	
  associated	
  genes	
  in	
  the	
  
single	
  cell	
  samples	
  analysed	
  aTer	
  influenza	
  infec6on	
  in	
  comparison	
  to	
  unperturbed	
  state	
  
(Table	
  1).	
  For	
  this	
  correla6on	
  we	
  will	
  take	
  in	
  considera6on	
  any	
  heterogeneity	
  of	
  cellular	
  
composi6on	
  as	
  covariates	
  for	
  adjus6ng	
  the	
  associated	
  response.	
  Ul6mately	
  the	
  top	
  5	
  
candidates	
  that	
  has	
  the	
  highest	
  degrees	
  of	
  reQTL	
  correla6ons	
  with	
  the	
  most	
  significant	
  gene	
  
up-­‐regula6ons	
  aTer	
  influenza	
  infec6on	
  will	
  be	
  deemed	
  the	
  most	
  fit	
  responders	
  for	
  our	
  
proposed	
  mission.	
  
Table1	
  :	
  Top	
  5	
  FLU	
  specific	
  reQTL	
  and	
  the	
  associated	
  genes	
  with	
  the	
  most	
  significantly	
  up-­‐
regulated	
  expression	
  in	
  response	
  to	
  specific	
  to	
  influenza	
  virus	
  characterised	
  by	
  Lee	
  et	
  al.	
  (1).	
  
The	
  SNPs/reQTLs	
  associated	
  genes	
  were	
  sorted	
  based	
  on	
  M-­‐value	
  	
  0.9	
  and	
  M-­‐value	
  	
  0.1	
  
inclusion	
  and	
  exclusion	
  criteria	
  respec6vely	
  and	
  on	
  their	
  significance	
  levels	
  using	
  data	
  from	
  
supplementary	
  table	
  4	
  sheet	
  I.	
  Delta	
  Meta	
  from	
  Lee	
  et	
  al.	
  (1).	
  
SNP	
  ID Gene	
   reQTL
FLU	
  	
  
p-­‐value
FLU	
  	
  
M-­‐value
LPS	
  
p-­‐value
LPS	
  	
  
M-­‐value
IFN	
  	
  
p-­‐value
exm-­‐rs1019503 ERAP2 TRUE 4.0217E-­‐212 1 7.30441E-­‐11 0 5.34953E-­‐32
rs6752483 ADCY3 TRUE 4.92854E-­‐24 1 0.000746454 0.081 0.559143
rs2285712 CCDC109B TRUE 2.32421E-­‐20 1 0.0845543 0 0.407339
rs2834160 IFNAR2 TRUE 4.09777E-­‐20 1 0.027355 0.001 0.0984183
rs1477478 IFNA21 TRUE 4.28414E-­‐18 1 0.351075 0 0.891976
4
Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
References	
  	
  
1.	
   M.	
  N.	
  Lee	
  et	
  al.,	
  Common	
  gene6c	
  variants	
  modulate	
  pathogen-­‐sensing	
  responses	
  in	
  
human	
  dendri6c	
  cells.	
  Science.	
  343,	
  1246980	
  (2014).	
  
2.	
   P.	
  L.	
  De	
  Jager	
  et	
  al.,	
  ImmVar	
  project:	
  Insights	
  and	
  design	
  considera6ons	
  for	
  future	
  
studies	
  of	
  “healthy”	
  immune	
  varia6on.	
  Semin.	
  Immunol.	
  27,	
  51–57	
  (2015).	
  
3.	
   D.	
  Altshuler,	
  M.	
  J.	
  Daly,	
  E.	
  S.	
  Lander,	
  Gene6c	
  mapping	
  in	
  human	
  disease.	
  Science.	
  322,	
  
881–888	
  (2008).	
  
4.	
   W.	
  G.	
  Feero,	
  A.	
  E.	
  GuDmacher,	
  T.	
  A.	
  Manolio,	
  Genomewide	
  Associa6on	
  Studies	
  and	
  
Assessment	
  of	
  the	
  Risk	
  of	
  Disease.	
  N	
  Engl	
  J	
  Med.	
  363,	
  166–176	
  (2010).	
  
5.	
   M.	
  A.	
  Schaub,	
  A.	
  P.	
  Boyle,	
  A.	
  Kundaje,	
  S.	
  Batzoglou,	
  M.	
  Snyder,	
  Linking	
  disease	
  
associa6ons	
  with	
  regulatory	
  informa6on	
  in	
  the	
  human	
  genome.	
  Genome	
  Res.	
  22,	
  
1748–1759	
  (2012).	
  
6.	
   I.	
  Gat-­‐Viks	
  et	
  al.,	
  Deciphering	
  molecular	
  circuits	
  from	
  gene6c	
  varia6on	
  underlying	
  
transcrip6onal	
  responsiveness	
  to	
  s6muli.	
  Nat.	
  Biotechnol.	
  31,	
  342–349	
  (2013).	
  
7.	
   G.	
  Gibson,	
  J.	
  E.	
  Powell,	
  U.	
  M.	
  Marigorta,	
  Expression	
  quan6ta6ve	
  trait	
  locus	
  analysis	
  for	
  
transla6onal	
  medicine.	
  Genome	
  Med.	
  7,	
  60	
  (2015).	
  
8.	
   B.	
  P.	
  Fairfax,	
  J.	
  C.	
  Knight,	
  Gene6cs	
  of	
  gene	
  expression	
  in	
  immunity	
  to	
  infec6on.	
  Curr	
  
Opin	
  Immunol.	
  30,	
  63–71	
  (2014).	
  
9.	
   M.	
  Çalışkan,	
  S.	
  W.	
  Baker,	
  Y.	
  Gilad,	
  C.	
  Ober,	
  Host	
  gene6c	
  varia6on	
  influences	
  gene	
  
expression	
  response	
  to	
  rhinovirus	
  infec6on.	
  PLoS	
  Genet.	
  11,	
  e1005111	
  (2015).	
  
10.	
   D.	
  Chaussabel,	
  Assessment	
  of	
  immune	
  status	
  using	
  blood	
  transcriptomics	
  and	
  
poten6al	
  implica6ons	
  for	
  global	
  health.	
  Semin.	
  Immunol.	
  27,	
  58–66	
  (2015).	
  
11.	
   H.-­‐J.	
  Westra,	
  L.	
  Franke,	
  From	
  genome	
  to	
  func6on	
  by	
  studying	
  eQTLs.	
  Biochim.	
  Biophys.	
  
Acta.	
  1842,	
  1896–1902	
  (2014).	
  
12.	
   C.	
  Trapnell	
  et	
  al.,	
  Differen6al	
  gene	
  and	
  transcript	
  expression	
  analysis	
  of	
  RNA-­‐seq	
  
experiments	
  with	
  TopHat	
  and	
  Cufflinks.	
  Nat	
  Protoc.	
  7,	
  562–578	
  (2012).	
  
13.	
   C.	
  Trapnell	
  et	
  al.,	
  Transcript	
  assembly	
  and	
  quan6fica6on	
  by	
  RNA-­‐Seq	
  reveals	
  
unannotated	
  transcripts	
  and	
  isoform	
  switching	
  during	
  cell	
  differen6a6on.	
  Nat.	
  
Biotechnol.	
  28,	
  511–515	
  (2010).	
  
14.	
   H.	
  Li,	
  Aligning	
  sequence	
  reads,	
  clone	
  sequences	
  and	
  assembly	
  con6gs	
  with	
  BWA-­‐
MEM.	
  arXiv	
  (2013).	
  
15.	
   E.	
  Garrison,	
  G.	
  Marth,	
  Haplotype-­‐based	
  variant	
  detec6on	
  from	
  short-­‐read	
  sequencing.	
  
arXiv	
  (2012).	
  
16.	
   K.	
  Wang,	
  M.	
  Li,	
  H.	
  Hakonarson,	
  ANNOVAR:	
  func6onal	
  annota6on	
  of	
  gene6c	
  variants	
  
from	
  high-­‐throughput	
  sequencing	
  data.	
  Nucleic	
  Acids	
  Res.	
  38,	
  e164–e164	
  (2010).	
  
5
Capstone Project for the Specialization in Systems Biology - August 2015
Fabio Amaral – fabioamaral@me.com
Annexed	
  Figure	
  1	
  
	
  
A-­‐Tuxedo	
  DifferenJal	
  Expression	
  RNA-­‐Seq	
  Pipeline	
  flowchart	
  
extracted	
  from	
  (12)	
  
	
  
B-­‐Genome	
  re-­‐sequencing	
  assembly	
  and	
  variant	
  calling	
  workflow
BWA-MEM
Map$genomic$reads
Picard
Add$read$groups
Picard
Merge$BAM$files
Samtools
Filter$reads
Picard
Sort$by$coordinates
Picard
Remove$duplicates
FreeBayes
Call$variants
VCFlib
VCF$filter
ANNOVAR
Annotate$variants
Annotated high
quality variants
Paired reads
BAM files
Fastq files
RG labeled BAM files
Files merged into one BAM file
High quality BAM file
Sorted BAM file
Dedup BAM file
VCF file
High quality calls (VCF)
VCF file
Genome reference
hg19.fasta
6

More Related Content

What's hot

Purdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINALPurdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINAL
Christy Cooper
 
Jc Rethinking Of Hsc Assays
Jc Rethinking Of Hsc AssaysJc Rethinking Of Hsc Assays
Jc Rethinking Of Hsc Assays
nanog
 
Array_nmeth.3507
Array_nmeth.3507Array_nmeth.3507
Array_nmeth.3507
Aana Hahn
 
Louis Chavez Senior Thesis
Louis Chavez Senior ThesisLouis Chavez Senior Thesis
Louis Chavez Senior Thesis
Louis Chavez
 
Guidelines and techniques for iPSC
Guidelines and techniques for iPSCGuidelines and techniques for iPSC
Guidelines and techniques for iPSC
lihuaibei
 
zandona14nipsA0
zandona14nipsA0zandona14nipsA0
zandona14nipsA0
Pia Sen
 
New insights into leukemic niche in bone marrow
New insights into leukemic niche in bone marrowNew insights into leukemic niche in bone marrow
New insights into leukemic niche in bone marrow
nanog
 
AJP_12-0313_Araten_et_al_Word_Version
AJP_12-0313_Araten_et_al_Word_VersionAJP_12-0313_Araten_et_al_Word_Version
AJP_12-0313_Araten_et_al_Word_Version
Jonathan Karten
 
JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...
JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...
JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...
Karolina Megiel
 
Delineating Recombination Frequency Between Methicillin Resistant and Suscept...
Delineating Recombination Frequency Between Methicillin Resistant and Suscept...Delineating Recombination Frequency Between Methicillin Resistant and Suscept...
Delineating Recombination Frequency Between Methicillin Resistant and Suscept...
JR Matthews
 

What's hot (20)

Systemic analysis of data combined from genetic qtl's and gene expression dat...
Systemic analysis of data combined from genetic qtl's and gene expression dat...Systemic analysis of data combined from genetic qtl's and gene expression dat...
Systemic analysis of data combined from genetic qtl's and gene expression dat...
 
FA abstract
FA abstractFA abstract
FA abstract
 
JClinChem_2003
JClinChem_2003JClinChem_2003
JClinChem_2003
 
The Effects of Genetic Alteration on Reprogramming of Fibroblasts into Induc...
The Effects of Genetic Alteration on Reprogramming of  Fibroblasts into Induc...The Effects of Genetic Alteration on Reprogramming of  Fibroblasts into Induc...
The Effects of Genetic Alteration on Reprogramming of Fibroblasts into Induc...
 
Purdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINALPurdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINAL
 
Jc Rethinking Of Hsc Assays
Jc Rethinking Of Hsc AssaysJc Rethinking Of Hsc Assays
Jc Rethinking Of Hsc Assays
 
Array_nmeth.3507
Array_nmeth.3507Array_nmeth.3507
Array_nmeth.3507
 
Louis Chavez Senior Thesis
Louis Chavez Senior ThesisLouis Chavez Senior Thesis
Louis Chavez Senior Thesis
 
Guidelines and techniques for iPSC
Guidelines and techniques for iPSCGuidelines and techniques for iPSC
Guidelines and techniques for iPSC
 
14KoVar
14KoVar14KoVar
14KoVar
 
zandona14nipsA0
zandona14nipsA0zandona14nipsA0
zandona14nipsA0
 
New insights into leukemic niche in bone marrow
New insights into leukemic niche in bone marrowNew insights into leukemic niche in bone marrow
New insights into leukemic niche in bone marrow
 
vineeta poster 2
vineeta  poster  2vineeta  poster  2
vineeta poster 2
 
AJP_12-0313_Araten_et_al_Word_Version
AJP_12-0313_Araten_et_al_Word_VersionAJP_12-0313_Araten_et_al_Word_Version
AJP_12-0313_Araten_et_al_Word_Version
 
Sortase Paper
Sortase PaperSortase Paper
Sortase Paper
 
JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...
JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...
JTM-Functional characterization of human Cd33+ And Cd11b+ myeloid-derived sup...
 
MCR_Article_JW
MCR_Article_JWMCR_Article_JW
MCR_Article_JW
 
Use of Methylation Markers for Age Estimation of an unknown Individual based ...
Use of Methylation Markers for Age Estimation of an unknown Individual based ...Use of Methylation Markers for Age Estimation of an unknown Individual based ...
Use of Methylation Markers for Age Estimation of an unknown Individual based ...
 
Delineating Recombination Frequency Between Methicillin Resistant and Suscept...
Delineating Recombination Frequency Between Methicillin Resistant and Suscept...Delineating Recombination Frequency Between Methicillin Resistant and Suscept...
Delineating Recombination Frequency Between Methicillin Resistant and Suscept...
 
1506.full
1506.full1506.full
1506.full
 

Viewers also liked

Viewers also liked (7)

Ryan Green-Resume
Ryan Green-ResumeRyan Green-Resume
Ryan Green-Resume
 
Thelma M. Holt-Nicholson Resume
Thelma M. Holt-Nicholson ResumeThelma M. Holt-Nicholson Resume
Thelma M. Holt-Nicholson Resume
 
NCBI presentation 27th April 2016
NCBI presentation 27th April 2016NCBI presentation 27th April 2016
NCBI presentation 27th April 2016
 
melding digital asset management
melding digital asset managementmelding digital asset management
melding digital asset management
 
#2 Resume Sherry Powers - Kentucky
#2 Resume Sherry Powers - Kentucky#2 Resume Sherry Powers - Kentucky
#2 Resume Sherry Powers - Kentucky
 
Himanshu_Oracle_DBA_Resume
Himanshu_Oracle_DBA_ResumeHimanshu_Oracle_DBA_Resume
Himanshu_Oracle_DBA_Resume
 
Health insurance in India- Dr Suraj Chawla
Health insurance in India- Dr Suraj ChawlaHealth insurance in India- Dr Suraj Chawla
Health insurance in India- Dr Suraj Chawla
 

Similar to FabioAmaralProject 3

Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
New York City College of Technology Computer Systems Technology Colloquium
 
IMM_752_kSORT_Whitepaper_2016_revfinal_NoCrops
IMM_752_kSORT_Whitepaper_2016_revfinal_NoCropsIMM_752_kSORT_Whitepaper_2016_revfinal_NoCrops
IMM_752_kSORT_Whitepaper_2016_revfinal_NoCrops
Kevin Jaglinski
 
Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus Strategy
Keiji Takamoto
 
A common rejection module (CRM) for acute rejection across multiple organs
A common rejection module (CRM) for acute rejection across multiple organsA common rejection module (CRM) for acute rejection across multiple organs
A common rejection module (CRM) for acute rejection across multiple organs
Kevin Jaglinski
 
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Thermo Fisher Scientific
 
Directed evolution of a recombinase that excises the provirus of most HIV-1 p...
Directed evolution of a recombinase that excises the provirus of most HIV-1 p...Directed evolution of a recombinase that excises the provirus of most HIV-1 p...
Directed evolution of a recombinase that excises the provirus of most HIV-1 p...
Carola Schäfer
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 

Similar to FabioAmaralProject 3 (20)

RapportHicham
RapportHichamRapportHicham
RapportHicham
 
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...
 
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...
 
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
Pharmacology Powered by Computational Analysis: Predicting Cardiotoxicity of ...
 
IMM_752_kSORT_Whitepaper_2016_revfinal_NoCrops
IMM_752_kSORT_Whitepaper_2016_revfinal_NoCropsIMM_752_kSORT_Whitepaper_2016_revfinal_NoCrops
IMM_752_kSORT_Whitepaper_2016_revfinal_NoCrops
 
Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus Strategy
 
Reference for long range pcr based ngs applications
Reference for long range pcr based ngs applicationsReference for long range pcr based ngs applications
Reference for long range pcr based ngs applications
 
A common rejection module (CRM) for acute rejection across multiple organs
A common rejection module (CRM) for acute rejection across multiple organsA common rejection module (CRM) for acute rejection across multiple organs
A common rejection module (CRM) for acute rejection across multiple organs
 
JoB spike in manuscript 2014
JoB spike in manuscript 2014JoB spike in manuscript 2014
JoB spike in manuscript 2014
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
 
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
 
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
 
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
 
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
The Prognostic Model of Differentiation-Related Lncrna Based on Bioinformatic...
 
PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...
PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...
PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...
 
PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...
PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...
PTPRC as a Predictive Marker Related to PD-L1 for Prognosis and Immunotherapy...
 
Directed evolution of a recombinase that excises the provirus of most HIV-1 p...
Directed evolution of a recombinase that excises the provirus of most HIV-1 p...Directed evolution of a recombinase that excises the provirus of most HIV-1 p...
Directed evolution of a recombinase that excises the provirus of most HIV-1 p...
 
journal.pone.0067445
journal.pone.0067445journal.pone.0067445
journal.pone.0067445
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Grindberg - PNAS
Grindberg - PNASGrindberg - PNAS
Grindberg - PNAS
 

FabioAmaralProject 3

  • 1. Capstone Project for the Specialization in Systems Biology - August 2015 Fabio Amaral – fabioamaral@me.com 1.  Project  Goal   This  project  is  aimed  at  se0ng  an  experimental  and  analy6cal  workflow  for  classifying   individuals  based  on  the  predicted  robustness  of  their  innate  immune  response  against   influenza  infec6on.  A  total  of  10  physician  volunteers  will  be  evaluated  and  the  top  five   ranked  candidates  will  be  selected  to  aDend  a  humanitarian  mission  at  a  region  affected  by   influenza  outbreak.  The  volunteers  will  be  required  to  donate  a  blood  sample  which  will  be   used  to  assess  the  levels  viral  response  via  an  integrated  genomic  and  gene  expression   signature  analysis  based  on  the  seminal  work  set  by  Lee  et  al.  (1)  performed  as  part  of  the   Phenogene6c  Project  and  ImmVar  Consor6um  (2).     2.  Background   Genome-­‐wide  associa6on  studies  (GWAS)  have  been  effec6ve  in  iden6fying  common   gene6c  variants  which  confer  suscep6bility  to  complex  diseases  and  other  phenotypes  of   interest  (3).  However  GWAS  usually  fail  in  pinpoin6ng  the  causa6ve  mechanisms  that  lead  to   such  traits.  This  is  mainly  due  to  variants  oTen  having  small  effect  sizes,  linkage   disequilibrium  between  associated  variants,  liDle  heritability  and  varied  influences  from  the   environment  (4).     Interes6ngly,  the  vast  majority  of  disease  associated  single  nucleo6de  polymorphisms  (SNP)   are  mapped  to  the  non-­‐coding  vicinity  of  genes,  oTen  affec6ng  their  level  of  expression  (5).   Variants  that  exert  their  regulatory  effect  on  the  steady-­‐state  levels  of  expression  have  been   called  expression  Quan6ta6ve  Trait  Loci  (eQTL)  or  response  Quan6ta6ve  Trait  Loci  (reQTL)   when  the  expression  interference  is  dependent  on  a  s6mulus(6)  and  these  QTLs  may   manifest  themselves  in  a  6ssue  or  cell-­‐type  specific  manner  (7).     Context  specific  reQTLs  are  of  extreme  relevance  in  immunity  to  infec6on  since  the   quan6ta6ve  changes  in  gene  expression  alter  the  outcome  of  immune  responses  to  a  wide   variety  of  perturba6ons  such  as  infec6on  with  various  pathogens  and  vaccina6on(8).   Moreover,  some  viral  infec6ons  (e.g.  rhinovirus)  can  lead  to  a  wide  spectrum  of  symptoms   severity  which  is,  at  least  in  part,  influenced  by  the  inter-­‐individual  reQTL  varia6ons  from  the   hosts  (9).     Furthermore,  different  pathogen  s6muli  can  affect  shared  reQTL  associated  genes  or  specific   ones  as  has  been  shown  in  a  recent  study  which  assessed  the  effect  of  s6mula6on  of   dendri6c  cells  (DC)  with  bacterial  lipopolysaccharide  (LPS)  ,  influenza  virus  and  interferon-­‐β   (1).  Of  the  commonly  found  121  reQTLs  (minor  allele  frequency  >5%)  iden6fied  in  this  study   only  7  loci  had  an  effect  on  near  by  genes  specifically  in  response  to  influenza  infec6on   which  make  these  reQTLs  interes6ng  parameters  for  the  assessment  of  the  response   effec6veness  to  such  viral  infec6ons.   Decisively,  the  profiling  of  blood  transcriptomics  integrated  with  genomics  scale  variant   genotyping  provides  an  aDrac6ve  mean  for  evalua6ng  the  immune  status  of  individuals  (10).   Therefore  we  propose  to  test  for  the  presence  of  the  most  significant  5  influenza  specific   reQTLs  (Table  1)  and  the  up-­‐regula6on  levels  of  the  associated  genes  as  a  proxy  for  a  robust   immune  response  against  influenza  in  order  to  select  five  physician  candidates  who  would   be  best  suited  to  aDend  a  popula6on  affected  by  a  flu  outbreak.   3.  Experimental  approach:   In  order  to  have  comparable  results  we  will  use  the  experimental  approach  set  by  Lee  et  al.   (1)  with  some  modifica6ons  to  account  for  the  current  technological  improvements  as   described  below.     1
  • 2. Capstone Project for the Specialization in Systems Biology - August 2015 Fabio Amaral – fabioamaral@me.com 3.1.  Whole  genome  re-­‐sequencing/variant  calling:  The  peripheral  blood  mononuclear  cells   (PBMC)  isolated  from  the  volunteers’  blood  will  be  used  for  variant  calling  by  next   genera6on  whole  genome  re-­‐sequencing  to  obtain  improved  variant  resolu6on.  DCs  will  be   derived  from  PBMCs  and  infected  with  influenza  virus  as  described  in  the  original  Science   report.     3.2.  DC  enrichment/single  cell  RNA-­‐Seq:  Since  the  observed  reQTLs  effect  sizes  could  vary   considerably  when  evalua6ng  DCs  as  a  bulk  due  to  heterogeneity  in  the  cell  popula6on   composi6on  we  opted  to  perform  a  magne6c  cell  enrichment  of  the  DC  popula6on  with  the   Blood  Dendri6c  Cell  Isola6on  Kit  II,  human  (Miltenyi  Biotec)  which  results  in  an  enriched  cell   frac6on  comprising  plasmacytoid  dendri6c  cells,  CD1c  (BDCA-­‐1)+  type-­‐1  myeloid  dendri6c   cells  (MDC1s),  and  CD1c  (BDCA-­‐1)-­‐  CD141  (BDCA-­‐3)  bright  type-­‐2  myeloid  dendri6c  cells   (MDC2s).  This  enriched  DC  frac6on  will  have  its  cell  content  profiled  via  flow  cytometry   analysis  followed  by  single  cell  RNA-­‐Seq  (2  x  96  cells  per  individual  res6ng  vs.  infected  cell   samples)  using  a  Fluidigm  plajorm,  which  will  also  help  to  improve  reQTL  analysis  resolu6on   (11).   4.  ComputaJonal  Analyses:     The  computa6on  analyses  for  the  transcriptomic  and  genomic  datasets  are  described  bellow   and  their  respec6ve  flow  charts  can  be  found  in  annexe  at  the  end  of  this  document.     4.1.  Fastq  files   Both  the  transcriptome  and  genome  assembly  pipelines  use  as  input  fastq  format  files.  Fastq   file  is  a  text  file,  which  contains  the  raw  informa6on  from  each  read  coming  out  of  next   genera6on  sequencing  experiment.  Each  read  is  represented  as  four  text  lines.  The  first  line   starts  with  an  @  sign  followed  by  the  sequence  iden6fier  with  some  op6onal  descrip6on.   The  second  line  contains  the  actual  nucleo6de  sequence  of  the  read.  The  third  line  contains   only  a  +  sign  that  marks  the  beginning  of  the  nucleo6de  quality  scores  in  the  fourth  line.   Each  nucleo6de  is  associated  with  a  Phred  quality  score  that  es6mates  its  reliability.  These   quality  scores  are  coded  in  ASCII  code  and  usually  the  aligners  assumes  Sanger  format   encoding  Phred+33  as  default  as  this  is  the  current  standard  format  since  Illumina  1.8.     4.2  DifferenJal  Expression  RNA-­‐Seq  (Tuxedo)  Pipeline  (View  annexed  figure  1-­‐A)   1)  TopHat  -­‐  RNA-­‐Seq  alignment:  High  quality  single-­‐end  RNA-­‐seq  reads  (fastq  files  as  input)   for  all  the  single  cell  samples  will  be  mapped  to  a  human  reference  genome  (hg19  build)   using  TopHat  aligner  (13).  TopHat  is  an  alignment  program  which  has  been  specifically   designed  for  the  analyses  of  RNA-­‐Seq  data  and  is  therefore  able  to  map  reads  to  the  genome   even  when  the  reads  span  splice  junc6ons  whose  genomic  regions  can  be  separated  by   rela6vely  large  intronic  regions.  TopHat  produces  the  following  output  files  for  each  sample   aligned:  a)  align_summary.txt,  b)  inser6ons.bed,  c)  dele6ons.bed,  d)  splice_junc6ons.bed   and  e)  accepted_hits.bam.   2)  Cufflinks  -­‐  Transcripts  assembly:  The  mapped  reads  for  the  expressed  genes  and   transcripts  will  be  assembled  for  each  sample  (accepted_hits.bam  files)  using  Cufflinks  and  a   reference  gene  annota6on  (hg19.gj)  to  es6mate  isoform  expression.  Cufflinks  is  both  the   name  of  a  suite  of  tools  and  a  program  within  that  suite.  The  program  Cufflinks  assembles   transcriptomes  from  RNA-­‐Seq  data  and  quan6fies  their  expression  producing  the  following   output  files  for  each  sample:  a)  gene_expression.tabular,  b)  transcript_expression.tabular,  c)   assembled_transcripts.gj  and  d)  skipped_transcripts.gj.   2
  • 3. Capstone Project for the Specialization in Systems Biology - August 2015 Fabio Amaral – fabioamaral@me.com 3-­‐4)  Cuffmerge  -­‐  Merge  assemblies:  A  file  called  assemblies.txt  that  lists  the  assembly  file   for  each  sample  (assembled_transcripts.gj)  will  be  created.  This  file  is  used  for  running   Cuffmerge  on  all  the  assemblies  (assemblies.txt)  to  create  a  single  merged  transcriptome   annota6on  using  the  references  for  gene  annota6on  (hg19.gj)  and  genomic  regions   (hg19_genome.fasta).  The  output  of  Cuffmerge  is  a  single  GTF  file  that  contains  an  assembly   that  merges  together  all  the  input  assemblies.   5)  Cuffdiff  -­‐  DifferenJal  expression  inference:  We  will  run  Cuffdiff  using  as  input  the  merged   transcriptome  assembly  GTF  file  along  with  the  BAM  files  (accepted_hits.bam)  from  TopHat   for  each  sample.  Cuffdiff  is  used  for  finding  significant  changes  in  transcript  expression,   splicing,  and  promoter  use  and  produces  11  output  files:  a.  Transcript  FPKM  (+count)   expression  tracking,  b.  Gene  FPKM  (+count)  expression  tracking,  c.  Primary  transcript  FPKM   (+count)  tracking,  d.  Coding  sequence  FPKM  (+count)  tracking,  e.  Transcript  differen6al   FPKM,  f.  Gene  differen6al  FPKM,  g.  Primary  transcript  differen6al  FPKM,  h.  Coding  sequence   differen6al  FPKM,  i.  Differen6al  splicing  tests,  j.  Differen6al  promoter  tests,  k.  Differen6al   CDS  tests.   6-­‐18)  DifferenJal  expression  analysis:  The  differen6al  expression  analysis  results  will  be   explored  with  CummeRbund  in  R  environment.  Cummerbund  uses  all  the  output  files  from   Cuffdiff  to  create  a  SQLite  database  of  results  with  the  descrip6on  of  the  rela6onship   between  genes,  transcripts,  transcrip6on  start  sites  and  CDS  regions.  The  stored  and  indexed   data  can  be  used  for  exploring  sub  features  of  individual  genes  or  gene  sets  and  used  for   plot  visualisa6ons  of  the  data.     4.3.  Genome  Re-­‐Sequencing  Assembly  and  Variant  Calling  (View  annexed  figure  1-­‐B)   1)  BWA-­‐MEM  -­‐  Map  genomic  reads  for  each  subject:  BWA  is  a  soTware  package  for   mapping  low-­‐divergent  sequences  against  a  large  reference  genome,  such  as  the  human   genome.  BWA-­‐MEM  is  the  latest  algorithm  which  was  designed  for  fast  and  accurate   mapping  of  high  quality  Illumina  sequence  reads  ranging  from  70bp  to  1Mbp  (14).  High   quality  paired-­‐end  reads  in  a  fastq  file  format  are  used  as  input  together  with  a  reference   genome  fasta  file  such  as  the  human  genome  hg19  build.  BWA-­‐MEM  produces  a   compressed  binary  BAM  file  as  an  output.   2)  Picard,  AddOrReplaceReadGroups  tool  -­‐  label  read  groups:  We  use  this  Picard  tool   func6on  to  label  the  reads  from  each  sample  in  the  BAM  files  before  merging  them  for   further  analysis.     3)  Picard,    MergeSamFiles  -­‐  Merge  the  read  group  labelled  BAM  files:  MergeSamFiles  is  also   a  component  of  Picard  and  is  used  for  merging  mul6ple  SAM/BAM  files  into  one  file.   4)  Samtools,  Filter  reads:  Samtools  is  used  to  filter  high  quality  mapped  and  proper  paired   reads.   5)  Picard,  Paired  Read  Mate  Fixer  -­‐  Sort  reads  by  coordinates:  This  Picard  func6on  can  be   used  to  adjust  the  ordering  of  reads.   6)  Picard,  MarkDuplicates  -­‐  Remove  all  duplicated  reads:  We  use  the  func6on   MarkDuplicates  to  remove  duplicates  which  are  amplifica6on  artefacts  from  library   prepara6on.   7)  FreeBayes  -­‐  Call  variants:  FreeBayes  (15)  is  a  Bayesian  based  gene6c  variant  detector   designed  to  find  small  polymorphisms  (SNPs,  indels,  MNPs  and  complex  events)  smaller  than   the  length  of  a  short-­‐read  sequencing  alignment.  The  processed  short-­‐read  alignment  BAM   3
  • 4. Capstone Project for the Specialization in Systems Biology - August 2015 Fabio Amaral – fabioamaral@me.com files  with  Phred+33  encoded  quality  scores  and  a  reference  genome  in  fasta  format  are  used   by  FreeBayes  to  determine  the  most-­‐likely  haplotype  for  the  individuals  at  each  posi6on  in   the  reference.  The  output  of  this  variant  caller  is  a  variant  call  file  (VCF)  format  that  reports   the  posi6ons,  which  it  finds  puta6vely  polymorphic.     8)  VCFlib  VCFfilter  -­‐  Filter  for  high  confidence  and  high  coverage  variant  calls:    We  use   VCFfilter  to  select  only  high  confidence  variant  calls  based  on  Phred  score  “QUAL    40  (false   discovery  rate  (FDR)  of  1  in  10,000)  and  enough  read  coverage  depth  “DP    5”.   9)  ANNOVAR  -­‐  Annotate  variants:  Finally  ANNOVAR  (16)  is  used  to  func6onally  annotate  the   gene6c  variants  detected  and  iden6fy  variants  that  are  documented  in  specific  databases   such  as  dbSNP  and  reports  its  allele  frequency  base  on  the  1000  Genome  Project,  NHLBI-­‐ESP   6500  exomes  or  Exome  Aggrega6on  Consor6um   5.  Expected  Data    InterpretaJon  of  Data   We  will  search  for  the  top  5  most  significant  FLU  specific  reQTLs  characterised  by  Lee  et.  al.   (1)  within  the  variants  iden6fied  from  our  10  genotyped  candidates.  The  presence  of  such   variants    will  be  correlated  with  a  significant  up-­‐regula6on  of  their  associated  genes  in  the   single  cell  samples  analysed  aTer  influenza  infec6on  in  comparison  to  unperturbed  state   (Table  1).  For  this  correla6on  we  will  take  in  considera6on  any  heterogeneity  of  cellular   composi6on  as  covariates  for  adjus6ng  the  associated  response.  Ul6mately  the  top  5   candidates  that  has  the  highest  degrees  of  reQTL  correla6ons  with  the  most  significant  gene   up-­‐regula6ons  aTer  influenza  infec6on  will  be  deemed  the  most  fit  responders  for  our   proposed  mission.   Table1  :  Top  5  FLU  specific  reQTL  and  the  associated  genes  with  the  most  significantly  up-­‐ regulated  expression  in  response  to  specific  to  influenza  virus  characterised  by  Lee  et  al.  (1).   The  SNPs/reQTLs  associated  genes  were  sorted  based  on  M-­‐value    0.9  and  M-­‐value    0.1   inclusion  and  exclusion  criteria  respec6vely  and  on  their  significance  levels  using  data  from   supplementary  table  4  sheet  I.  Delta  Meta  from  Lee  et  al.  (1).   SNP  ID Gene   reQTL FLU     p-­‐value FLU     M-­‐value LPS   p-­‐value LPS     M-­‐value IFN     p-­‐value exm-­‐rs1019503 ERAP2 TRUE 4.0217E-­‐212 1 7.30441E-­‐11 0 5.34953E-­‐32 rs6752483 ADCY3 TRUE 4.92854E-­‐24 1 0.000746454 0.081 0.559143 rs2285712 CCDC109B TRUE 2.32421E-­‐20 1 0.0845543 0 0.407339 rs2834160 IFNAR2 TRUE 4.09777E-­‐20 1 0.027355 0.001 0.0984183 rs1477478 IFNA21 TRUE 4.28414E-­‐18 1 0.351075 0 0.891976 4
  • 5. Capstone Project for the Specialization in Systems Biology - August 2015 Fabio Amaral – fabioamaral@me.com References     1.   M.  N.  Lee  et  al.,  Common  gene6c  variants  modulate  pathogen-­‐sensing  responses  in   human  dendri6c  cells.  Science.  343,  1246980  (2014).   2.   P.  L.  De  Jager  et  al.,  ImmVar  project:  Insights  and  design  considera6ons  for  future   studies  of  “healthy”  immune  varia6on.  Semin.  Immunol.  27,  51–57  (2015).   3.   D.  Altshuler,  M.  J.  Daly,  E.  S.  Lander,  Gene6c  mapping  in  human  disease.  Science.  322,   881–888  (2008).   4.   W.  G.  Feero,  A.  E.  GuDmacher,  T.  A.  Manolio,  Genomewide  Associa6on  Studies  and   Assessment  of  the  Risk  of  Disease.  N  Engl  J  Med.  363,  166–176  (2010).   5.   M.  A.  Schaub,  A.  P.  Boyle,  A.  Kundaje,  S.  Batzoglou,  M.  Snyder,  Linking  disease   associa6ons  with  regulatory  informa6on  in  the  human  genome.  Genome  Res.  22,   1748–1759  (2012).   6.   I.  Gat-­‐Viks  et  al.,  Deciphering  molecular  circuits  from  gene6c  varia6on  underlying   transcrip6onal  responsiveness  to  s6muli.  Nat.  Biotechnol.  31,  342–349  (2013).   7.   G.  Gibson,  J.  E.  Powell,  U.  M.  Marigorta,  Expression  quan6ta6ve  trait  locus  analysis  for   transla6onal  medicine.  Genome  Med.  7,  60  (2015).   8.   B.  P.  Fairfax,  J.  C.  Knight,  Gene6cs  of  gene  expression  in  immunity  to  infec6on.  Curr   Opin  Immunol.  30,  63–71  (2014).   9.   M.  Çalışkan,  S.  W.  Baker,  Y.  Gilad,  C.  Ober,  Host  gene6c  varia6on  influences  gene   expression  response  to  rhinovirus  infec6on.  PLoS  Genet.  11,  e1005111  (2015).   10.   D.  Chaussabel,  Assessment  of  immune  status  using  blood  transcriptomics  and   poten6al  implica6ons  for  global  health.  Semin.  Immunol.  27,  58–66  (2015).   11.   H.-­‐J.  Westra,  L.  Franke,  From  genome  to  func6on  by  studying  eQTLs.  Biochim.  Biophys.   Acta.  1842,  1896–1902  (2014).   12.   C.  Trapnell  et  al.,  Differen6al  gene  and  transcript  expression  analysis  of  RNA-­‐seq   experiments  with  TopHat  and  Cufflinks.  Nat  Protoc.  7,  562–578  (2012).   13.   C.  Trapnell  et  al.,  Transcript  assembly  and  quan6fica6on  by  RNA-­‐Seq  reveals   unannotated  transcripts  and  isoform  switching  during  cell  differen6a6on.  Nat.   Biotechnol.  28,  511–515  (2010).   14.   H.  Li,  Aligning  sequence  reads,  clone  sequences  and  assembly  con6gs  with  BWA-­‐ MEM.  arXiv  (2013).   15.   E.  Garrison,  G.  Marth,  Haplotype-­‐based  variant  detec6on  from  short-­‐read  sequencing.   arXiv  (2012).   16.   K.  Wang,  M.  Li,  H.  Hakonarson,  ANNOVAR:  func6onal  annota6on  of  gene6c  variants   from  high-­‐throughput  sequencing  data.  Nucleic  Acids  Res.  38,  e164–e164  (2010).   5
  • 6. Capstone Project for the Specialization in Systems Biology - August 2015 Fabio Amaral – fabioamaral@me.com Annexed  Figure  1     A-­‐Tuxedo  DifferenJal  Expression  RNA-­‐Seq  Pipeline  flowchart   extracted  from  (12)     B-­‐Genome  re-­‐sequencing  assembly  and  variant  calling  workflow BWA-MEM Map$genomic$reads Picard Add$read$groups Picard Merge$BAM$files Samtools Filter$reads Picard Sort$by$coordinates Picard Remove$duplicates FreeBayes Call$variants VCFlib VCF$filter ANNOVAR Annotate$variants Annotated high quality variants Paired reads BAM files Fastq files RG labeled BAM files Files merged into one BAM file High quality BAM file Sorted BAM file Dedup BAM file VCF file High quality calls (VCF) VCF file Genome reference hg19.fasta 6