0
All	
  kmers	
  are	
  not	
  created	
  equal:	
  
finding	
  the	
  signal	
  from	
  the	
  noise	
  
in	
  large-­‐scal...
Apology:	
  I	
  speak	
  biology	
  	
  
with	
  an	
  accent	
  
•  I	
  spent	
  six	
  years	
  in	
  dark	
  rooms	
 ...
Apology:	
  I	
  speak	
  biology	
  	
  
with	
  an	
  accent	
  
•  I	
  spent	
  six	
  years	
  in	
  dark	
  rooms	
 ...
•  Sequences	
  are	
  different	
  
•  How	
  much	
  did	
  my	
  sequencing	
  run	
  give	
  me?	
  	
  	
  
kmerspectr...
•  Sequences	
  are	
  different	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (math)	
  
•  How	
...
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita<vely	
  differ...
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita<vely	
  differ...
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita<vely	
  differ...
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAG...
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGC...
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGC...
How	
  long	
  do	
  reads	
  need	
  to	
  be?	
  
Informa4on	
  	
  	
  (Shannon,	
  1949,	
  BSTJ):	
  	
  	
  
	
  
	
...
A	
  word	
  on	
  the	
  sign	
  of	
  the	
  entropy	
  	
  
•  A	
  popular	
  straw	
  man	
  among-­‐mathema<cians-­‐...
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  ...
•  Informa<on	
  content	
  of	
  English	
  words:	
  
	
  	
  	
  	
  	
  	
  	
  	
  Hword	
  	
  	
  	
  	
  	
  	
  	...
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  ...
•  Informa<on	
  content	
  of	
  English	
  words:	
  
	
  	
  	
  	
  	
  	
  	
  	
  Hword	
  	
  	
  	
  	
  	
  	
  	...
•  Maximum	
  informa<on	
  content	
  of	
  	
  	
  	
  	
  base	
  pairs	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
The	
  data	
  deluge	
  
•  There	
  were	
  some	
  technological	
  
breakthroughs	
  in	
  the	
  mid-­‐2000s	
  that	...
Picture,	
  if	
  you	
  will,	
  a	
  hiseq	
  flowcell	
  
Paris	
  of	
  microbial	
  	
  
genomes	
  	
  
Microbial	
  ...
Picture,	
  if	
  you	
  will,	
  a	
  hiseq	
  flowcell	
  
Paris	
  of	
  microbial	
  	
  
genomes	
  	
  
Microbial	
  ...
The	
  kmer	
  spectrum.	
  
21mer	
  abundance	
  	
  
number	
  of	
  kmers	
  
microbial	
  genome	
  
The	
  kmer	
  spectrum.	
  	
  
21mer	
  abundance	
  	
  
number	
  of	
  kmers	
  
microbial	
  genome	
  
low-­‐abunda...
Ranked	
  kmer	
  spectrum	
  	
  	
  
kmer	
  rank	
  (cumula<ve	
  sum	
  of	
  number	
  of	
  kmers)	
  
21mer	
  abun...
Ranked	
  kmers	
  consumed	
  
21mer	
  abundance	
  	
  
frac<on	
  of	
  observed	
  kmers	
  
Ranked	
  kmers	
  consu...
Different	
  kinds	
  of	
  data	
  have	
  different	
  
spectra	
  
Different	
  kinds	
  of	
  data	
  have	
  different	
  
spectra	
  
Redundancy	
  is	
  good	
  
•  OMG!	
  	
  	
  Check	
  out	
  these	
  three	
  sequences!	
  	
  I’ve	
  
found	
  the	...
Redundancy	
  is	
  good	
  
•  OMG!	
  	
  	
  Check	
  out	
  these	
  three	
  sequences!	
  	
  I’ve	
  
found	
  the	...
kmerspectrumanalyzer:	
  infer	
  genome	
  
size	
  and	
  depth	
  
PNO (x; c, {an}, s) =
X
n
anNBpdf (s; µ = cn, ↵ = s/...
0 2000 4000 6000 8000 10000
0
2000
4000
6000
8000
10000
Complete Genome size (kb)
EstimatedGenomeSize(kb)
Fig 2 Coun<ng	
 ...
10%	
  
5.5%	
  
4%	
  
3%	
  
1.7%	
  
1%	
  
0.5%	
  
0.3%	
  
0.1%	
  
The	
  kink	
  does	
  measure	
  error	
  
Ar<fi...
But	
  I	
  want	
  to	
  sequence	
  everything!	
  
Ok,	
  we	
  can	
  count	
  kmers	
  in	
  everything	
  too..	
  
...
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
How	
  many	
  sequences	
  do	
  you	
  need	
  to	
  see	
  bef...
Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How	
  much	
  n...
Nonpareil: model of sequence coverage	

Nonpareil-k: kmer rarefaction 	

summary of sequence diversity	

Nonpareil–	
  use...
Nonpareil: model of sequence coverage	

Nonpareil-k: kmer rarefaction 	

summary of sequence diversity
Nonpareil-­‐k:	
  stra<fy	
  datasets	
  by	
  
coverage	
  distribu<on	
  
most	
  of	
  dataset	
  
likely	
  contained	...
kmer	
  spectra	
  reveal	
  sequencing	
  
problems	
  
•  Amok	
  PCR	
  –	
  seemingly	
  random	
  sequences	
  
•  Am...
HMP	
  /	
  quan<le	
  norm	
  /	
  euclidean	
  /	
  colored	
  by	
  alpha	
  	
  
	
  
MG-­‐RAST	
  API	
  
R-­‐package...
Figure'2a!
Hey	
  kid,	
  you	
  want	
  some	
  preXy	
  ordina<ons?	
  
Generali<es	
  from	
  the	
  	
  
kmer	
  coun<ng	
  mines	
  
•  Many	
  datasets	
  have	
  as	
  much	
  as	
  5-­‐45%...
kmer	
  sta<s<cal	
  summaries	
  
•  H0	
  kmer	
  richness	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
kmer	
  sta<s<cal	
  summaries	
  
•  H0	
  kmer	
  richness	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
thumbnailpolish!
http://www.mcs.anl.gov/~trimble/flowcell/!
Some<mes	
  the	
  sequencer	
  has	
  a	
  bad	
  day.	
  
Metagenomic	
  annota<on	
  group	
  
	
  
Folker	
  Meyer	
  
Elizabeth	
  Glass	
  
Narayan	
  Desai	
  
Kevin	
  Keegan...
Observa<on:	
  Most	
  scien<sts	
  seem	
  to	
  
be	
  self-­‐taught	
  in	
  compu<ng.	
  
	
  
Observa<on:	
  	
  Most...
We	
  teach	
  scien<sts	
  
	
  how	
  to	
  get	
  more	
  done	
  
Woods	
  Hole	
  
Tuks	
  
U.	
  Chicago	
  
U.	
  C...
Upcoming SlideShare
Loading in...5
×

All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

111

Published on

Talk by Will Trimble of Argonne National Laboratory, on April 23, 2014, at MSU's BEACON Center for the Study of Evolution in Action on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.

Published in: Science, Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
111
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes"

  1. 1. All  kmers  are  not  created  equal:   finding  the  signal  from  the  noise   in  large-­‐scale  metagenomes.   Will  Trimble   metagenomic  annota<on  group   Argonne  Na<onal  Laboratory   BEACON  seminar     April  23,  2014        MSU  
  2. 2. Apology:  I  speak  biology     with  an  accent   •  I  spent  six  years  in  dark  rooms  with  lasers   •  Now  I  use  computers  to  analyze  high-­‐throughput   sequence  data.   •  I  introduce  myself  as  an  applied  mathema<cian.   •  Finding  scoring  func<ons  to  answer  ques<ons  with   ambiguous  data    
  3. 3. Apology:  I  speak  biology     with  an  accent   •  I  spent  six  years  in  dark  rooms  with  lasers   •  Now  I  use  computers  to  analyze  high-­‐throughput   sequence  data.   •  I  introduce  myself  as  an  applied  mathema<cian.   •  Finding  scoring  func<ons  to  answer  ques<ons  with   ambiguous  data   •  Shoveling  data  from  the  data  producing  machine  into   the  data-­‐consuming  furnace.    
  4. 4. •  Sequences  are  different   •  How  much  did  my  sequencing  run  give  me?       kmerspectrumanalyzer! •  How  much  did  I  sample?   nonpareil-k   •  PreXy  pictures   thumbnailpolish! Outline  
  5. 5. •  Sequences  are  different                                    (math)   •  How  much  did  my  sequencing  run  give  me?       kmerspectrumanalyzer (graphs)   •  How  much  did  I  sample?   nonpareil-k (graphs)   •  PreXy  pictures   thumbnailpolish (micrographs)! Outline  
  6. 6. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita<vely  different  from  all  other  data   types.       @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound     High  throughput   sequence  data     Categoriza4on  is  an  art  
  7. 7. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita<vely  different  from  all  other  data   types.       @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound     High  throughput   sequence  data     Categoriza4on  is  an  art   107  channels   103  channels   1011  channels  
  8. 8. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita<vely  different  from  all  other  data   types.     •  Each  sequence  is  an  informa<on-­‐rich  (possibly   corrupted)  quota4on  from  the  catalog  of   gene<c  polymers.  
  9. 9. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results” Searching   We  know  what  to  do  with  these  puzzles.       You  go  to  this  website,  and  type  it  in…  
  10. 10. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results”   Searching   How  long  do  reads  need  to  be     to  recognize  them?  
  11. 11. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results”   Searching   How  long  do  reads  need  to  be     to  recognize  them?   To  do  what,  to  place  on  a  reference  genome?         this  can  be  turned  into  a  math  problem     that  I  will  illustrate  with  a  search  engine  analogy.        
  12. 12. How  long  do  reads  need  to  be?   Informa4on      (Shannon,  1949,  BSTJ):               is  a  quan<ta<ve  summary  of  the  uncertainty  of  a   probability  distribu4on  –  a  model  of  the  data     Profound  applicability  in  paXern  matching  +  modeling   Logarithmic  measurements  have  units!   H = X i pi log2 ✓ 1 pi ◆
  13. 13. A  word  on  the  sign  of  the  entropy     •  A  popular  straw  man  among-­‐mathema<cians-­‐ and-­‐CS-­‐people  is  the  “random  sequence   model.”    Uniform  categorical  distribu<on  over   all  4L    sequences.     •  When  we  learn  something—like  we  collect   some  genomes  and  expect  our  new  sequences   to  look  like  them—we  implicitly  construct  a  less   flat  distribu<on.    Models  always  have  less   entropy  than  the  model  of  ignorance.  
  14. 14. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
  15. 15. •  Informa<on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.   •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits   •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words   to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                                                                                                      Try  it.       How  long  do  phrases  need  to  be?  
  16. 16. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.! Most  oken  takes  4  words  
  17. 17. •  Informa<on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.   •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits   •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words   to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                                                                                                      Try  it.       How  long  do  phrases  need  to  be?   Not  all  phrases  are  equally   dis<nc<ve.  
  18. 18. •  Maximum  informa<on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence   •  Most  long  kmers  are  dis<nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits   •  So  we  expect  that  when  2        >  34  bits,  we  should  be   able  to  place  any  sequence.   •  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the   genome.     How  long  do  reads  need  to  be?   ` ` ` `
  19. 19. The  data  deluge   •  There  were  some  technological   breakthroughs  in  the  mid-­‐2000s  that   led  to  inexpensive  collec<on  of  10s   of  Gbytes  of  sequence  data  at  once.   •  The  data  has  outgrown  some   favorite  algorithms  from  the  1990s   (BLAST)    
  20. 20. Picture,  if  you  will,  a  hiseq  flowcell   Paris  of  microbial     genomes     Microbial     transcriptomes  +  replicates   Environmental  isolate  genomes     Environmental  extract  sequencing     Prepara<on-­‐intensive  sequencing   Eukaryo<c    sequencing   Eukaryo<c    sequencing  for  variants   What’s  in   there?  
  21. 21. Picture,  if  you  will,  a  hiseq  flowcell   Paris  of  microbial     genomes     Microbial     transcriptomes  +  replicates   Environmental  isolate  genomes     Environmental  extract  sequencing     Prepara<on-­‐intensive  sequencing   Eukaryo<c    sequencing   Eukaryo<c    sequencing  for  variants   What’s  in   there?   Let’s  count   kmers!  
  22. 22. The  kmer  spectrum.   21mer  abundance     number  of  kmers   microbial  genome  
  23. 23. The  kmer  spectrum.     21mer  abundance     number  of  kmers   microbial  genome   low-­‐abundance  errors   peak  contains  most  of  genome   high-­‐abundance  peak  contains  mul<copy  genes   really  high  abundance  stuff  oken  ar<facts   rare   abundant  
  24. 24. Ranked  kmer  spectrum       kmer  rank  (cumula<ve  sum  of  number  of  kmers)   21mer  abundance         Ranked  kmer  spectrum   rare   abundant  
  25. 25. Ranked  kmers  consumed   21mer  abundance     frac<on  of  observed  kmers   Ranked  kmers  consumed   rare   abundant   data  frac<on   is  unusually     stable  
  26. 26. Different  kinds  of  data  have  different   spectra  
  27. 27. Different  kinds  of  data  have  different   spectra  
  28. 28. Redundancy  is  good   •  OMG!      Check  out  these  three  sequences!    I’ve   found  the  fourth,  fikh,  and  sixth  domains  of  life.             •  OMG!    I  see  this  sequence  10  million  <mes.       •  OMG!    There  are  more  than  10  billion  dis<nct   31mers  in  my  dataset.    I  only  have  128  Gbases  of   memory.   •  Error  correc<on  and  diginorm  somewhat   amusingly  strive  for  opposite  ends.  
  29. 29. Redundancy  is  good   •  OMG!      Check  out  these  three  sequences!    I’ve   found  the  fourth,  fikh,  and  sixth  domains  of  life.             •  OMG!    I  see  this  sequence  10  million  <mes.       •  OMG!    There  are  more  than  10  billion  dis<nct   31mers  in  my  dataset.    I  only  have  128  Gbases  of   memory.   •  Error  correc<on  and  diginorm  somewhat   amusingly  strive  for  opposite  ends.   Abundance-­‐based  inferences   are  beXer  in  the  high-­‐ abundance  part  of  the  data.  
  30. 30. kmerspectrumanalyzer:  infer  genome   size  and  depth   PNO (x; c, {an}, s) = X n anNBpdf (s; µ = cn, ↵ = s/n) Generaliza<on  of  mixed-­‐Poisson  model  to     es<mate  how  much  sequence  is  in  each  peak.  
  31. 31. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Complete Genome size (kb) EstimatedGenomeSize(kb) Fig 2 Coun<ng  kmers  tells  you  genome  size   …for  single  genomes,   most  of  the  <me.   so  much  for  calibra<on  data  
  32. 32. 10%   5.5%   4%   3%   1.7%   1%   0.5%   0.3%   0.1%   The  kink  does  measure  error   Ar<ficial  E.  coli  data   varying  subs<tu<on  errors  
  33. 33. But  I  want  to  sequence  everything!   Ok,  we  can  count  kmers  in  everything  too..   kmerspectrumanalyzer  summarizes  distribu<on,     es<mates  genome  size,  coverage  depth  
  34. 34. How  much  novelty  is  in  my  dataset?   How  many  sequences  do  you  need  to  see  before  you  start  seeing     the  same  ones  over  and  over  again?     Ini<ally,  everything  is  novel,  but  there  will  come  a  point  at  which     less  than  half  of  your  new  observa<ons  are  already  in  the  catalog.  
  35. 35. Nonuniqefraction(✏; {r}, {n}) = X i ni · ri P j nj · rj (1 Poisscdf (✏ · ri, 1)) (1 Poisscdf (✏ · ri, 0)) How  much  novelty  is  in  my  dataset?   How  many  sequences  do  you  need  to  see  before  you  start  seeing     the  same  ones  over  and  over  again?     Ini<ally,  everything  is  novel,  but  there  will  come  a  point  at  which     less  than  half  of  your  new  observa<ons  are  already  in  the  catalog.     We  can  calculate  this  efficiently  using  the  kmer  spectrum.  
  36. 36. Nonpareil: model of sequence coverage Nonpareil-k: kmer rarefaction summary of sequence diversity Nonpareil–  uses  subset-­‐against-­‐ all  alignment  to  find  out  how   much  of  dataset  is  unique   Nonpareil-­‐k  –  crunches  kmer   spectrum  to  approximate  the   unique  frac<on,  300x  faster.  
  37. 37. Nonpareil: model of sequence coverage Nonpareil-k: kmer rarefaction summary of sequence diversity
  38. 38. Nonpareil-­‐k:  stra<fy  datasets  by   coverage  distribu<on   most  of  dataset   likely  contained  in     assembly     assembly  is  likely   to  miss  or     aXenuate  the     large  unique     frac<on  of  dataset.    
  39. 39. kmer  spectra  reveal  sequencing   problems   •  Amok  PCR  –  seemingly  random  sequences   •  Amok  MDA  –  10  Gbases  of  sequence,  one  gene   •  PCR  duplicates:  en<re  sequencing  run  was  50x   exact-­‐  and  near-­‐exact  duplicate  reads   •  Unusually  high  error  rate:  indicated  by  low  frac<on   of  “solid”  kmers  (for  isolate  genomes)   •  Contaminated  samples:  95%  E.  coli  5%  E.  faecalis  
  40. 40. HMP  /  quan<le  norm  /  euclidean  /  colored  by  alpha       MG-­‐RAST  API   R-­‐package  matR   Hey  kid,  you  want  some  unlabeled  data?  
  41. 41. Figure'2a! Hey  kid,  you  want  some  preXy  ordina<ons?  
  42. 42. Generali<es  from  the     kmer  coun<ng  mines   •  Many  datasets  have  as  much  as  5-­‐45%  of  the   sequence  yield  in  adapters.       •  FEW  DATASETS  have  well-­‐separated   abundance  peaks  (of  the  sort  metavelvet  was   engineered  to  find)       •  Diverse  datasets  have  a  featureless,   geometric  rela4onship  between  kmer  rank   and  kmer  abundance.   •  Shannon  entropy  is  oversensi4ve  to  errors.   Higher-­‐order  Rényi  entropy  is  more  stable.  
  43. 43. kmer  sta<s<cal  summaries   •  H0  kmer  richness                                                              (VERY  BAD)   •  H1  Shannon  entropy                                                  (BAD)   •  H2  Reyni  entropy  /  Simpson  index  (GOOD)   •  observa<on-­‐weighted      coverage    (BAD)   •  observa<on-­‐weighted      size                        (BAD)   •  observa<on-­‐median            coverage    (GOOD)   •  observa<on-­‐median            size                        (GOOD)   •  frac<on  in  top  100  kmers              (USEFUL)   •  frac<on  unique  (OK  but  requires  size  correc<on)  
  44. 44. kmer  sta<s<cal  summaries   •  H0  kmer  richness                                                              (VERY  BAD)   •  H1  Shannon  entropy                                                  (BAD)   •  H2  Reyni  entropy  /  Simpson  index  (GOOD)   •  observa<on-­‐weighted      coverage    (BAD)   •  observa<on-­‐weighted      size                        (BAD)   •  observa<on-­‐median            coverage    (GOOD)   •  observa<on-­‐median            size                        (GOOD)   •  frac<on  in  top  100  kmers              (USEFUL)   •  frac<on  unique  (OK  but  requires  size  correc<on)   Most  of  these  give  answers  which   vary  so  strongly  with  sampling   depth  as  to  be  unusable.     Observa<on-­‐weighted  frac<on-­‐of-­‐ data  metrics    behave  fairly  well.     Frac<ons  of  the  data  with  par<cular   proper<es  are  stable  with  respect   to  sampling.      
  45. 45. thumbnailpolish! http://www.mcs.anl.gov/~trimble/flowcell/!
  46. 46. Some<mes  the  sequencer  has  a  bad  day.  
  47. 47. Metagenomic  annota<on  group     Folker  Meyer   Elizabeth  Glass   Narayan  Desai   Kevin  Keegan     Adina  Howe   Wolfgang  Gerlach   Wei  Tang   Travis  Harrison   Jared  Bishof   Dan  Braithwaite   Hunter  MaXhews   Sarah  Owens   Formerly  of  Yale:   Howard  Ochman     David  Williams     Georgia  Tech:   Kostas  Konstan<nidis   Luis  Rodriguez-­‐Rojas    
  48. 48. Observa<on:  Most  scien<sts  seem  to   be  self-­‐taught  in  compu<ng.     Observa<on:    Most  scien<sts  waste  a     lot  of  <me  using  computers   inefficiently.   Adina  and  I  volunteer  with    
  49. 49. We  teach  scien<sts    how  to  get  more  done   Woods  Hole   Tuks   U.  Chicago   U.  Chicago   UIC  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×