Christopher W. Beitel, Aaron Darling
Genomics revolutionized our understanding of bacteria
Perna et al 2001 Nature, Welch et al 2002 PNAS
Three genomes, same species only 40% genes in common
Phylogenetic diversity: the known unknowns
Can amplify and sequence the 16S rRNA gene present in all bacteria and
Millions of 16S sequences obtained per run
Align sequences, build evolutionary tree, analyze
Rinke et al 2013, Nature
Uncultured bacterial diversity greatly exceeds known isolate diversity
Is metagenomics the answer...??
Smash with bead beater
Shear DNA to 400nt frags
Metagenomics loses long-range linkage
At least two paths forward:
1. Physically “dissect” microbial
2. Better inference methods
Prep sequencer library
3. Nanopore long read sequencing?
What is Hi-C?
High throughput sequencing of proximity-ligation products
Chromosome conformation capture: sdfdf
Lieberman-Aiden et al 2009
Single (human) cell Hi-C:
Nagano et al 2013
We propose to use Hi-C for metagenomics
Conjecture: DNA in the same cell will become crosslinked, ligated, and sequenced
more frequently (green links) than DNA fragments from different cells (red links)
A little experiment to test Hi-C
Selected 5 strains, 4 species:
+ E. coli BL21 (BL21)
+ E. coli K12 (K12)
+ Pediococcus pentosaceus (Ped)
+ Burkholderia thailandensis (2 chromosomes: Bur1, Bur2)
+ Lactobacillus brevis (chromosome + 2 plasmids: Lac0, Lac1, Lac2)
Grew up cultures, mixed in equal parts based on OD600
Applied standard Hi-C protocol, sequenced on Illumina MiSeq
~22M read pairs, 160nt reads
Reconstructing genomes from within metagenomes
A classic problem in metagenomics
Can't de novo assemble Hi-C data due to presence of ligation
junctions, other issues
To test reconstruction we instead simulated reads with
BioGrinder and assembled with soapdenovo.
A Hi-C scaffold graph
Nodes are assembly contigs
node size ∝ contig size
edge weights ∝ normalized
Hi-C read counts linking contigs
Fruchterman Reingold layout
Nodes colored by SPECIES
Can we actually compute clusters?
Markov clustering, I=1.1:
4 clusters, >97% of genome
in each bin
Contig clustering: why it works
Rate at which Hi-C reads associate
within and across species
Hi− read pair insert distributions
L. brevis ATCC367
P. pentosaceus ATCC25745
E. coli K12 DH10B
E. coli BL21(DE3)
B. thailandensis E264 chrI
B. thailandensis E264 chrII
Insert distance in nucleotides
99% of Hi-C read pairs associate within-species
Hi-C read pairs associate throughout the chromosome
Spans distances that no existing or imaginary “long read” technology can cover
Resolving strain differences: Single nucleotide variants
- SNPs between E. coli K12 and BL21 are nodes
- Edges link SNPs observed in same read pair
Are Hi-C graphs are better connected than mate-pair??
10k MP 20k MP Hi-C
Resolving strain differences: Hi-C versus mate-pair
Distance on genome versus path length in variant graph
Mate pair graph distances scale linearly.
Hi-C graphs are scale-free.
Estimation error in probability of variant linkage grows with path length!
Opportunities to collaborate
1. Species clustering of assemblies
2. Resolving strain haplotypes
3. Application to microbial communities!
4. Improvement of the Hi-C protocol
5. Population metagenomics – track gene flow of, e.g. antibiotic
6. Your idea (and some grant funding) here
Manuscript preprint online at https://peerj.com/preprints/260/
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.