2. Outline
• About Naturalis Biodiversity
Center
• NBC's facilities and expertises
Ancient DNA lab
Barcode lab
Informatics focus group
• Metagenomics and
paleoecology
• Use case: The mammoth's last
meal
• NGS@Naturalis
• Pipeline development
Metagenomics approaches and data analysis 6 February 2013
3. Naturalis Biodiversity Center
• With 37 million specimens,
NBC holds one of the largest
natural history collections in the
world
• More than just a museum, NBC is
an expert center specializing in:
Species identification
Trait harvesting
Impact modeling
Ecological intensivation
Metagenomics approaches and data analysis 6 February 2013
4. Ancient DNA lab
• The ancient-DNA facility is equipped
for recovering DNA from plant and
animal material from museum
collections and fossils.
• It permits research that would
otherwise not be possible, such as
the study of ancient populations
and museum material.
• The ancient-DNA lab provides an
environment where the risk of
contamination with contemporary
DNA is minimal.
• The facility, a collaboration of IBL,
the faculty of archeology and
NBC, is unique in the Netherlands
Metagenomics approaches and data analysis 6 February 2013
6. Informatics focus group
• Exploitation of HPC resources
• Dissemination of best practices
• In-house development of research-
supporting tools:
NGS data processing
Clustering, BLASTing
Custom pipelines
Visualization
Image analysis
Niche modeling
Metagenomics approaches and data analysis 6 February 2013
7. HPC infrastructure
• Dell T7500 and T7600 workstations
• Intel® Xeon® Processor
(QuadCore, 2.40GHz) x 2
• 128Gb RAM
• TESLA/NVIDIA GPU
• RedHat/Ubuntu Linux
• Always looking for extra
numbercrunching power, e.g. from
NBIC Galaxy, CIPRES, BioPortal,
etc.
Metagenomics approaches and data analysis 6 February 2013
8. Paleoenvironments
• Reconstructing the
paleoenvironment is useful for:
Understanding the dynamics
of ecosystem change
Reconstructing pre-
industrialization ecosystems
• Many public policy decision-
makers have pointed to the
importance of using
palaeoecological studies as a
basis for choices made in
conservation ecology
Metagenomics approaches and data analysis 6 February 2013
9. Metagenomics
• Taxonomic identification is
one of the main challenges
surrounding metagenomics,
and one of NBC’s core
strengths
• Conversely, a better
understanding of the
metagenome feeds back into
our other research interests
and expertises
• Consequently, a lot of research
activity and ongoing capacity
building
Metagenomics approaches and data analysis 6 February 2013
10. Use case: the woolly mammoth's dietary
metagenome
Metagenomics approaches and data analysis 6 February 2013
11. The research programme
• To test hypotheses about the structure of the ancient environment
of the woolly mammoth, i.e. productive, continuous grassland
steppe or sparsely covered herb tundra
• Finding frozen mammoths with forensically identiable food,
parasites, and microorganisms in their gastrointestinal tracts or
feces has the potential of adding data to the extinction debate
• To integrate the findings from ancient DNA with those obtained
from macro- and micro-fossils
Metagenomics approaches and data analysis 6 February 2013
13. "Lyuba"
• Discovered in May 2007
• One-month old mammoth calf
• Age: 41,910 ± 550 YBP
• Very well-nourished, milk-fed
Metagenomics approaches and data analysis 6 February 2013
14. The Yukagir mammoth
• Male woolly mammal
• Discovered in 2002
• Very well preserved in the
permafrost
• Age: 18,560 ± 50 YBP
• Head, front legs, parts of
stomach and intestinal tract
• Last meal still preserved
Metagenomics approaches and data analysis 6 February 2013
15. The Cape Blossom mammoth dung
• Produced during the cold season
• Found among a partial skeleton
• Exact site unknown
• Age: 12,300 YBP
Metagenomics approaches and data analysis 6 February 2013
16. DNA extraction and sequencing
• In all studies, macro-fossils
(stems, leaves, seeds), micro-
fossils (pollen) and ancient
DNA were compared
• DNA was extracted in the
ancient DNA facility using
multiple extraction protocols
• Several commonly-used
markers were amplified (trnL,
rbcL, nrITS1)
• Sanger sequencing was done
on an ABI 3730xl
Metagenomics approaches and data analysis 6 February 2013
17. Data analysis
• Sequences were assembled using
Sequencher
• Taxa were assigned using a
combination of GenBank BLAST
searches and phylogenetic
inference
• BLAST hits were only accepted if
they covered the full query
sequence and differed by at most
1 nucleotide
• Phylogenetic placement was
determined on the basis of
bootstrap support (1000 replicates
using paup*)
Metagenomics approaches and data analysis 6 February 2013
18. Findings
• Ancient DNA could assign 7 ("Lyuba"), 12 ("dung") and
8 ("Yukagir") plant families, with several determinations
down to genus level
• Molecules complemented and confirmed fossils
• Identified vegetation composition is generally
supportive of a productive "mammoth steppe"
• Micro-fossils of specific dung fungi showed that
mammoths appear, unlike elephants, to be habitually
coprophagous
Metagenomics approaches and data analysis 6 February 2013
19. Next generation applications
• The results of the mammoth research
so far have been obtained using
Sanger sequencing
• Similar, as yet unpublished, research is
being undertaken with the newly
acquired IonTorrent "sequencing by
synthesis" platform
Marcel Eurlings at Naturalis
Metagenomics approaches and data analysis 6 February 2013
21. IonTorrent data pre-processing workflow
Filter out
short reads
Splice out low
phred scores
Split by Split by
primer adapter
sequence sequence
FASTA for downstream analysis
Metagenomics approaches and data analysis 6 February 2013
22. Taxonomic identification pipeline
• Taxonomic identification of the
contents of samples is a generic
problem for which we have
developed a re-usable pipeline
• It replicates some of the
functionality of QIIME but integrates
more conveniently in our HPC
configuration
• Requirements:
Python 2.7 or 3.2
Biopython 1.58
NCBI-Blast-2.2.25+
Clustering programs, e.g. TGICL,
Usearch, Octupus, cd-hit
Metagenomics approaches and data analysis 6 February 2013
23. Pipeline steps
Optional: tag FASTA for provenance retracing across files
Cluster sequences into OCTUs of at least 10 reads
Pick exemplar sequence (random, consensus or hybrid)
BLAST exemplar sequences (local or remote)
Optional: retrace provenance
Report
Metagenomics approaches and data analysis 6 February 2013
24. Pipeline extensions
• NBC frequently deals with samples
that may contain materials from
endangered species, for example:
Putative FSC wood
Traditional Chinese medicine
Incense
• We are therefore extending the
taxonomic identification pipeline to
check automatically whether any taxa
from the sample are listed in CITES
appendices
• This, however, poses additional
challenges of taxonomic name
reconciliation
Metagenomics approaches and data analysis 6 February 2013
25. Other metagenomics work
• Phylogenies from
metagenomic sequence data
can grow to immense sizes
• For example, the GreenGenes
16S rRNA tree has ~400k tips
• We are developing novel
algorithms for pruning these
trees using (Google’s)
MapReduce programming
model
Metagenomics approaches and data analysis 6 February 2013
26. Acknowledgements
• I am grateful to:
• Dr. Barbara Gravendeel for her input in
developing this talk
• Youri Lammers for his great working in developing
a well-documented taxonomic identification
pipeline
• And to NBIC for giving me the opportunity to
present this story
Metagenomics approaches and data analysis 6 February 2013