Pathway analysis 2012


Published on

Slides on pathway analysis given to the Nataro & Lazo labs at UVA in November 2012.

Published in: Education
  • Be the first to comment

Pathway analysis 2012

  1. 1. Pathway  Analysis   Adding  Func2onal  Context  to  High-­‐Throughput  Results   Stephen  D.  Turner,  Ph.D.   Bioinforma2cs  Core  Director  
  2. 2. Outline  •  Bioinforma2cs  &  the  Bioinforma2cs  Core  •  Service  Highlight:  Pathway  Analysis  •  IPA  demo  December  20,  2012  
  3. 3. Bioinforma2cs  Origins  •  Rooted  in  sequence  analysis  •  Driven  by  need  to:   -  Collect   -  Annotate   -  Analyze  
  4. 4. What  is  bioinforma2cs?  (Diagram  modified  from  @drewconway)  
  5. 5. What  is  bioinforma2cs?  “There  is  a  tremendous  amount  of  informa4on  regarding  evolu&onary   history   and   biochemical   func&on   implicit   in  each   sequence   and   the   number   of   known   sequences   is  growing  explosively.  We  feel  it  is  important  to  collect  this  significant   informa4on,   correlate   it   into   a   unified   whole  and  interpret  it.”   M.  Dayhoff,  February  27,  1967  
  6. 6. UVA  Bioinforma2cs  Core   •  A  centralized  resource  for  providing  expert  and   2mely  bioinforma2cs  consul2ng  and  data  analysis.   •  Main  goals:  help  you  publish  and  get  funding.   –  1.  Service   –  2.  Training  December  20,  2012  
  7. 7. This  is  the   Sample prep“stuff”  we  do  in  the  bioinforma2cs  core!   Sequenci ng Raw data Differential expression Gene identification Novel Genes Discoveries …etc. Find  out  what  this  “stuff”  is  at  
  8. 8. Services  •  Gene  expression:  Microarray  Analysis  •  Gene  expression:  RNA-­‐seq  Analysis  •  Pathway  analysis  •  DNA  Varia2on  (GWAS,  NGS)  •  DNA  Binding  /  ChIP-­‐Seq  •  DNA  Methyla2on  •  Grant  /  Manuscript  support  •  Custom  development  December  20,  2012  
  9. 9. Services   Gene  expression:  Microarray  Analysis     •  Accession  and  analysis  of  publicly  available  data  (e.g.  GEO,  ArrayExpress).   •  Preprocessing:  background  subtrac2on,  summariza2on,  and  quan2le  normaliza2on  using   RMA  (Robust  Mul2chip  Average)  expression  measure  described  in  Irizarry  et  al.  Biosta2s2cs   4:249-­‐264.   •  Quality  assessment:   –  Visualiza2on  of  signal  intensity  distribu2ons  of  each  array  using  boxplots  and  density  plots.   –  MA  plots  to  visualize  signal  intensity  over  average  intensity.   –  Principal  components  analysis  to  visualize  the  overall  data  (dis)similarity  between  arrays.   •  Analysis:   –  Es2ma2on  of  fold  changes  and  standard  errors  using  a  linear  model.   –  Empirical  Bayes  smoothing  to  standard  errors.   –  Lists  of  top  differen2ally  expressed  genes,  fold  changes,  sta2s2cal  significance,  mul2ple  tes2ng  correc2on.   •  Visualiza2on:   –  Heatmaps  and  dendrograms.   –  Volcano  plots  to  visualize  sta2s2cal  significance  by  fold  change.   •  Biological  context  –  Pathway/Func2onal  Analysis.  December  20,  2012  
  10. 10. Services   Gene  expression:  RNA-­‐seq     •  Pre-­‐alignment  quality  assessment:   –  Per-­‐base  sequence  quality   –  Per-­‐base  sequence  content   –  Per-­‐base  GC  content   –  Search  for  overrepresented  sequences  (adapters,  primers,  etc)   •  Alignment  to  a  reference  genome:   –  Homo  sapiens   –  Mus  musculus   –  Rahus  norvegicus   –  Bos  taurus   –  Canis  familiaris   –  Gallus  gallus   –  Drosophila  melanogaster   –  Arabidopsis  thaliana   –  Caenorhabdi2s  elegans   –  Saccharomyces  cerevisiae   •  Post-­‐alignment  quality  assessment:   –  Flagging  duplicate  reads   –  Es2ma2on  of  library  complexity   –  Insert  size  distribu2on  (for  paired-­‐end  sequencing)   –  Analysis  of  coverage  over  transcript  posi2on   •  Transcript  assembly   •  Differen2al  expression  tes2ng   –  Isoforms   –  Genes   –  Primary  transcripts   –  Coding  sequence   •  Differen2al  splicing  analysis     •  Differen2al  coding  output     •  Differen2al  promoter  use   •  Visualiza2on:  assistance  with  visualiza2on  using  IGV.  December  20,  2012  
  11. 11. Services   DNA  Varia2on:  Genotyping     •  Study  design  &  power  calcula2ons  for  SNP  genotype-­‐phenotype  associa2on  studies   •  Data  management  and  quality  control   •  PCA  for  popula2on  stra2fica2on  control   •  Imputa2on  to  a  reference  popula2on  (e.g.  HapMap,  1000  Genomes)   •  Analysis,  interpreta2on,  visualiza2on   •  Manuscript  prepara2on   •  Grant  support  (compliance  with  NIH  data  sharing  policies,  methodology  for  data  management,   design,  analysis,  and  interpreta2on)   •  Acquisi2on  of  publicly  available  data  (dbGaP)   DNA  Varia2on:  Next-­‐Gen  Sequencing     •  Alignment  to  a  reference  genome   •  Calibra2on  of  quality  scores  and  duplicate  read  removal   •  Variant  calling   •  Variant  annota2on   •  SNP  effect  predic2on   •  De  novo  assembly   •  Any  of  the  applicable  analysis,  interpreta2on,  and  visualiza2on  services  described  above  for   genotyping  data.  December  20,  2012  
  12. 12. Service  Highlight:  “Pathway  Analysis”  •  You’ve  done  your  microarray/RNA-­‐Seq  experiment   –  You  have  a  list  of  genes   –  Want  to  put  these  into  func2onal  context   –  What  biological  processes  are  perturbed?   –  What  pathways  are  being  dysregulated?   –  Data  reduc2on:  hundreds  or  thousands  of  genes  can  be  reduced  to   10s  of  pathways   –  Iden2fying  ac2ve  pathways  =  more  explanatory  power  •  “Pathway  analysis”  encompasses  many,  many  techniques.   1.  1st  Genera2on:  Overrepresenta2on  Analysis  (E.g.  GO  ORA)   2.  2nd  Genera2on:  Func2onal  Class  Scoring  (e.g.  GSEA)   3.  3rd  Genera2on  (in  development):  Pathway  Topology  (E.g.  SPIA)   •­‐analysis  December  20,  2012  
  13. 13. Over-­‐representa2on  analysis  (ORA)  •  Many  varia2ons  on  the  same  theme:  sta2s2cally   evaluates  the  frac2on  of  genes  in  par2cular  pathway   that  show  changes  in  expression.  •  Algorithm:   1.  Create  input  list  (e.g.  “significant  at  p<0.05”)   2.  For  each  gene  set:   a.  Count  number  of  input  genes   b.  Count  number  of  “background”  genes  (e.g.  all  genes  on  plaoorm).   3.  Test  each  pathway  for  over-­‐representa2on  of  input  genes   •  Gene  Set:  typically  gene  ontology  (GO)  term.  December  20,  2012  
  14. 14. Gene  Ontology   •  Ontology  =  formal  representa2on  of  a  knowledge  domain.   •  Gene  ontology  =  cell  biology.   •  GO  represented  by  directed  acyclic  graph  (DAG).   –  Terms  are  nodes,  rela2onships  are  edges.   –  Parent  terms  are  more  general  than  their  child  terms.   –  Unlike  a  simple  tree,  terms  can  have  mul2ple  parents.  Rhee,  S.  Y.,  Wood,  V.,  Dolinski,  K.,  &  Draghici,  S.  (2008).  Use  and  misuse  of  the  gene  ontology  annota2ons.  Nature  reviews.  Gene2cs,  9(7),  509-­‐15.  doi:10.1038/nrg2363   December  20,  2012  
  15. 15. GO  ORA:  Example  •  Algorithm:   1.  Create  input  list  (e.g.  “significant  at  p<0.05”)   2.  For  each  gene  set:   a.  Count  number  of  input  genes   b.  Count  number  of  “background”  genes  (e.g.  all  genes  on  plaoorm).   3.  Test  each  pathway  for  over-­‐representa2on  of  input  genes  •  Ex:  GO  “Purine  Ribonucleo2de  Biosynthe2c  Process”   –  1%  of  input  (significant)  genes  are  annotated  with  this  term.   –  1%  of  genes  on  the  chip  are  annotated  with  this  term.   –  Not  significantly  overrepresented.  •  Ex:  GO  “V(D)J  Recombina2on”   –  20%  of  input  (significant)  genes  are  annotated  with  this  term.   –  1%  of  genes  on  the  chip  are  annotated  with  this  term.   –  Highly  significantly  over-­‐represented!.  December  20,  2012  
  16. 16. GO  ORA:  Example  December  20,  2012  
  17. 17. GO  ORA:  Limita2ons  •  Some  categories  are  so  general  they’re  meaningless   (e.g.  “cellular  process”).  •  ORA  uses  genes  above  a  cutoff  and  discards  everything   else.  •  ORA  only  uses  the  number  genes,  and  ignores  their   measured  changes.  •  Two  assump2ons  violated   –  Genes  are  independent  (NOT!  Coexpression,  interac2on,  etc).   –  Pathways  are  independent  (by  defini2on  violated  by  DAG).  December  20,  2012  
  18. 18. Func2onal  Class  Scoring  •  Theory:  while  large  changes  in  individual  genes  can  have   significant  effects  on  pathways,  weaker  but  coordinated   changes  in  sets  of  func2onally  related  genes  can  also   have  significant  effects.  •  General  Algorithm:   1.  Compute  gene-­‐level  sta2s2c  (e.g.  Fold  Change,  student’s  t).   2.  Aggregate  gene  level  sta2s2cs  for  all  genes  in  pathway  into   single  pathway-­‐level  sta2s2c.   3.  Assess  significance  with  permuta2on.  December  20,  2012  
  19. 19. Gene  Set  Enrichment  Analysis   1.  Calculate  an  Enrichment  Score   a)  Rank  genes  by  their  expression  difference   b)  For  each  Gene  Set*:     i.  Compute  cumula2ve  sum  over  ranked  genes   1.  Increase  sum  when  gene  is  in  set,  decrease  otherwise   2.  Magnitude  of  increment  depends  on  gene-­‐phenotype  correla2on   ii.  Record  the  maximum  devia2on  from  zero  as  Enrichment  Score  (ES)   2.  Assess  significance   a)  Permute  phenotype  (or  gene  labels)  1000  2mes   b)  Compute  ES  score  for  each  permuta2on  (empiric  null).   c)  Compare  ES  score  for  actual  data  to  distribu2on  of  ES  scores  from  permuted   data.   d)  Normalize  ES  by  accoun2ng  for  gene  set  size   e)  Control  mul2ple  tes2ng  by  calcula2ng  FDR  for  each  NES   •  *  Gene  sets:  Come  from  MSigDB   –  hhp://   –  MSigDB  is  collec2on  of  annotated  gene  sets  for  use  with  GSEA  sovware.     –  Posi2onal,  curated,  computa2onally  predicted,  GO.   –  Curated:  KEGG,  Reactome,  STKE,  etc.  December  20,  2012  
  20. 20. GSEA:  Example  December  20,  2012  
  21. 21. FCS/GSEA:  Limita2ons  •  Violate  same  assump2ons  as  GO-­‐ORA:   –  Genes  are  independent   –  Pathways  are  independent  •  Only  consider  number/magnitude  of  genes,  and  ignore   other  informa2on  in  databases:   –  Direc4onality  of  the  interac2on   –  Nature  of  the  interac2on  (ac2va2ng,  inhibi2on,  etc).   –  Where  the  interac2on  occurs  (nucleus,  cytoplasm,  etc).  December  20,  2012  
  22. 22. Pathway  Topology:  SPIA  •  U2lizes  direc2onality,   func2on,  and  topology.  •  Computes  two  orthogonal   p-­‐values:   –  pNDE:  Number  of   Differen2ally  Expressed  genes   (E.g.  like  ORA).   –  pPERT:  degree  of  perturba2on  •  pG  is  overall  p-­‐value  (pNDE   and  pPERT  combined)  •  pGFDR  is  overall  FDR-­‐ corrected  p-­‐value  December  20,  2012  
  23. 23. Pathway  Topology:  SPIA  •  TCR  Signaling   Pathway  Results   –  pNDE:  6.5e-­‐9   –  pPERT:  .29   –  pGFDR:  1.2e-­‐6   –  Conclusion:  many   differen2ally   expressed  genes,   but  pathway  may   not  be  badly   perturbed.  December  20,  2012  
  24. 24. Pathway  Topology  /  SPIA:  Limita2ons  •  With  SPIA,  s2ll  need  arbitrary  “cutoff”  e.g.  top  500,  or   p<0.05,  etc.  •  True  topology  is  dependent  on  type  of  cell  due  to  cell-­‐ specific  gene  expression  profiles.  •  Tissue-­‐specific  topology  is  rarely  available  and   fragmented  in  databases,  even  if  it’s  fully  understood.  •  Other  general  limita2ons  of  pathway  analysis  -­‐-­‐-­‐  December  20,  2012  
  25. 25. Pathway  Analysis:  General  Limita2ons  •  Low  resolu2on  knowledge  bases   –  E.g.  RNA-­‐seq  studies  have  found  >90%  of  transcriptome  is   alterna2vely  spliced.   –  Different  transcripts  can  have  different  or  opposing  func2ons.  •  Incomplete/inaccurate  annota2ons.  •  Oct  2007:  95%  GO  annota2ons  inferred  electronically   (i.e.  not  manually  curated).  •  Missing  condi2on-­‐  and  cell-­‐specific  informa2on.  •  Methodological  challenge:  lack  of  benchmarks.  December  20,  2012  
  26. 26. Pathway  Analysis:  Conclusions   Pathway  analysis  gives  you  more  biological     insight  than  staring  at  lists  of  genes.       Pathway  analysis  is  complex,  and  has  many  limita2ons.       Pathway  analysis  is  s2ll  more  of  an  exploratory     procedure  rather  than  a  pure  sta2s2cal  endpoint.       The  best  conclusions  are  made  by  viewing  enrichment  analysis     results  through  the  lens  of  the  inves4gator’s  expert  biological  knowledge.  December  20,  2012  
  27. 27. IPA  Demo  •  Background:  Microarray  data  from  Childhood  Exacerbated   Asthma  compared  to  normal  state.    •  Ques2ons:  Do  data  supported  involvement  of  immune/ inflammatory  responses  and  viral  infec2on  in  the  acute  asthma   ahack?  •  Tasks:     –  View  Canonical  pathways  that  contain  significant  numbers  of  genes  from   this  dataset.   –  Overlay  a  Func2on/Disease  state  that  shows  how  key  signaling  pathways   for  figh2ng  off  respiratory  infec2ons  overlapped  with  asthma2c   inflamma2on.   –  Overlay  Biomarkers  that  iden2fy  genes  in  the  infec2on  signaling  pathway   that  are  also  used  for  diagnosis  and  efficacy  indicators  for  asthma   treatments.   –  Search  the  Ingenuity  Knowledge  Base  for  literature  references  that  support   your  findings.   –  Inves2gate  a  “weird”  finding…  December  20,  2012  
  28. 28. Thank  you   Web:   E-­‐mail:   Blog:    www.Ge{   Twiher:  December  20,  2012