2. whoami? -> Titus Brown
• Bioinformatics method development, evaluaton, and
application.
• Strangely passionate about environmental
metagenomics…
• Cross-cutting interests:
• Good (efficient, accurate, remixable) software
development, on the open source model.
• Reproducibility.
• Open science; preprints, blogging, Twitter.
@ctitusbrown on Twitter
http://ivory.idyll.org/blog/
3. The Plan
•Saturday afternoon: Intro to shotgun sequencing
and metagenome assembly
•Saturday evening: lobster feast!
•Sunday: break!
•Monday morning: binning genomes!
•Monday afternoon: more assembly!
•Monday evening: workflows (and werewolf ?)
•Tuesday morning: k-mers are awesome!
•Tuesday afternoon: Functional analysis!
•Tuesday evening: Capstone!
5. Shotgun sequencing & de novo assembly:
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was
the age of foolishness
6. Shotgun sequencing analogy:
feeding books into a paper shredder,
digitizing the shreds, and reconstructing
the book.
Although for books, we often know the language and not just the alphabet
8. Shotgun sequencing
“Coverage” is the average number of reads that overlap
each true base in (meta)genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
11. Goals of shotgun metagenomics
• Expand beyond taxonomic/community structure characterization
possible with 16s;
• Analyze virus, plasmid, strain-level content;
• Evaluate metabolic capacity (e.g. “is nirK present?”)
• Reconstruct genomes from metagenomes, if possible.
12. 16s? Or Shotgun metagenomics?
•Cons vs amplicon:
•lower coverage / more expensive (good? bad?
what are the tradeoffs?)
•much more computationally challenging to
analyze
•Pros vs amplicon:
•different bias (no primers)
•virus/phage can be detected
•function can be more directly detected
•recover (putative) genomes
13. Shotgun metagenome assembly: reconstruct
original genome by finding overlaps in data
UMD assembly primer (cbcb.umd.edu)
Randomly sequencing DNA, then finding overlaps and inferring true sequence:
14. Shotgun sequencing & de novo assembly:
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was
the age of foolishness
15. Note: Shotgun metagenome data is always
incomplete.
Smaller circles represent (some of)
your actual community.
Blue circle represents what’s in your
sequencing data set.
Shotgun metagenome data may not
contain everything in your
community; may contain strain
variants; may contain “unknown”
microbes.
16. What can we do with shotgun
metagenomics?
• do taxonomic analysis directly on the reads - (Tuesday!)
• search the reads for genes of interest (function, taxonomy)
• assemble the reads into contigs, longer stretches of DNA - (next few
days)
• annotate contigs with taxonomy, function
• (note distinction between de novo assembly and reference-based assembly)
• cluster contigs together to extract genome bins, aka “metagenome
assembled genomes” - (Monday!)
• compare contig abundances across samples to look for differentially
abundant sequences
• (see Mike Lee’s diagram of All the Tools!)
17. Important notes on assembly:
•assembly squashes abundance 8:
•assembly ignores complicated regions :(
•assembly is surprisingly accurate and (when
using megahit, at least) computationally
tractable :)
18. Important notes on quantifying assembled
contigs:
• it’s not clear what tools to use for differential
abundance analysis
• we have a tutorial for quantifying contig
abundance with salmon - see
also notebook
• mapping abundances look wavy
19. Some open computational research
questions:
•right now assembly based approaches simply
…discard some proportion of the data. we
should figure out a better way.
•what is the value of long reads in shotgun
metagenomics?
20.
21. Assembly results
% megahit -1 R1.fq.gz -2 R2.fq.gz
…
--- [STAT] 7774 contigs, total 4987609 bp, min 200 bp, max 8658 bp,
avg 642 bp, N50 1049 bp
% head final.contigs.fa
>k141_3 flag=0 multi=20.8090 len=230
TGATCCTGTAGTGACTAACGGACGGATGTAAGCATCTTTTAGACCATTACGTT
CTAACAATTCGTAAGTAATTTGAGTTAGTTGTTCTTCAGAATAATTCAATTTAAT
ATTCATGACTTCTGCTCCATACTTAAGCCTTTTGTAATGCTCATATGACTTAAAA
ATTTTAGTTTCTTCGTTGGTATTGTATGCTCTTATACCTTCGAAAACCCCATTTC
CATAGTGTAAA