Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Biodiversity
Informa1cs
of
the

Cyperaceae:
Where
we
stand
and

where
we’re
heading

Andrew
Hipp,
Marlene
Hahn,

Ed
Baker,
Vince
Smith
and

The
Cariceae
Working
Group

A
set
of
tools
for
Cariceae

informa1cs

Andrew
Hipp,
Marlene
Hahn,

Ed
Baker,
Vince
Smith
and

The
Cariceae
Working
Group

Iden1fy
gaps
in
our

knowledge
and

sampling

Formulate
sampling

plan

New
collec1ons

DNA

sequences

DNA
matrices

Mul1ple

alignments

Species
tree

es1mates

Revised

classiﬁca1on

A
central
database
for
specimen-‐level
data

What
tools
do
we
need?

• An
easily-‐updated
hierarchical
checklist
to
visualize

sampling
progress
across
labs,
extrac1ons,
sequences;

• 
A
specimen-‐level
phylogene6cs
pipeline
that
we
can
use

to
harvest
exis1ng
data
from
NCBI
as
well
as
generate

ongoing
phylogene1c
snapshots;

• 
A
way
to
automate
mapping
from
specimen
data,
so
that

we
can
visualize
(and
assess
our
visualiza1ons
of)
species

distribu1ons
in
geographic
and
ecological
space;
and

• 
A
pla8orm
for
collabora6on
–
a
virtual
research

environment
to
bring
together
researchers
worldwide

I.
A
hierarchical
checklist
and

sampling
progress
reports

In
2011

•  A
ﬂat
checklist
exported

from
WCM

•  A
set
of
spreadsheets
from

collabora1ng
labs

inventorying
their
DNA
and

sequence
collec1ons

•  A
vague
idea
of
what
trips

are
needed

Today

•  A
hierarchical
checklist
by

subgenus,
sec1on

•  A
synthesis
of
what

materials
and
sequences

collaborators
have
on
hand,

and
what
taxa
are

unsampled

•  A
concrete
sampling
plan

with
trips
and
taxa

iden1ﬁed*

*
Okay,
we’re
working
on
this
one!

Taxonomy

Specimen(s)

DNA

extrac6on(s)

Sequence(s)

Trace
ﬁle(s)
/

con6g(s)

We
are
aiming
toward
a

database
in
which
the

taxonomy,
specimen

data,
DNA
extrac1ons,

raw
sequencing
data
and

DNA
matrices
all
live

together
and
can
be

curated
and
worked
on

jointly
by
the
community.

Taxonomy

Specimen(s)

DNA

extrac6on(s)

Sequence(s)

Trace
ﬁle(s)
/

con6g(s)

Spring
2012:
Hierarchical
checklist

Taxonomy

Specimen(s)

DNA

extrac6on(s)

Sequence(s)

Trace
ﬁle(s)
/

con6g(s)

!

Taxonomy

Specimen(s)

DNA

extrac6on(s)

Sequence(s)

Trace
ﬁle(s)
/

con6g(s)

!

Specimen
Record

Tissue

Extrac1on

DNA
seq.

Metadata
ﬂow

DNA
seq.

DNA
seq.

A
centralized
workflow

•  Spreadsheets
imported
into
a
single
Excel
file

•  Names
cleaned
(variable)

•  DNA
data
summary
formula
created
for
each

spreadsheet
(ca.
5
mins
per
user)

•  Names
matched
to
our
Scratchpads
checklist

•  All
files
exported
to
CSV

•  Sample
sheets
and
SP
checklist
imported
to
R

•  DNA
records
added
to
checklist
as
nodes
that
are

children
to
their
taxa.

•  Hierarchical
checklist
exported
in
text
format,
with

unsampled
taxa
marked
for
searching

ß
Sec1on
name

ß
Sampled
taxon
with
its
DNA
vouchers
and
summaries

ß
Unsampled
taxon

Because
Kew
has
coded
geography
using
TDWG

standards,
we
can
export
geographic
hit-‐lists

Taxonomy

Specimen(s)

DNA

extrac6on(s)

Sequence(s)

Trace
ﬁle(s)
/

con6g(s)

!

!

!

?

II.
A
specimen-‐level

phylogene1c
pipeline

NCBI
is
a
morass
of
data.

Geneious

•  Query
nucleo1de

database
(NCBI)
for

Organism
contains:
“Carex”,
“Uncinia”,

“Schoenoxiphium”,
“Kobresia”,

“Vesicarex”,
or
“Cymophyllus”

•  Export
as

•  FASTA

•  TAB-‐Delim

•  XML

•  Only
export
that
maintains
all
informa1on

in
NCBI.

•  Necessary
to
obtain
data
that
can
be
used

to
connect
sequence
to
a
specimen.

Hinchliﬀ
and
Roalson.
2013.
Systema(c
Biology
62:
205–219.

A
workflow
for
specimen-‐level
mul1gene

datasets
from
NCBI

•  Download
from
NCBI
[we
used
Geneious,
but
any
bulk
download
is

fine]

•  Parse
out
collector
name,
collector
number,
isolate
number,
geography

•  Manually
clean
collector
names
(3
days
for
>6500
records)

•  Iden1fy
specimens
by
unique
combina1ons
of
collector
name,
collector

number,
isolate

•  Toss
out
“accessions”
having
more
than
one
scien1fic
name

•  Clean
gene
region
names
so
that
names
are
not
duplicated
(30
minutes

for
>6500
records)

•  Export
datasets
to
MUSCLE
and
align;
export
log
file

•  Manually
check
alignments
and
code
logfile
(D,
RC;
variable)

•  Rerun
MUSCLE
and
export
RAxML
batchfile

•  Analyze

•  Screen
for
non-‐monophyly;
concatenate
and
con1nue!

6692
sequence
records
in
Cariceae

Tab-‐delimited
metadata
from
NCBI
/
Geneious
is

handy,
but
it
lacks
almost
all
the
informa1on
that

could
be
used
as
voucher
IDs.
No
way
to
link

sequences
to
specimens!

However,
some
NCBI

records
do
contain
this
data.
How
do
we
access
it?

NCBI

Specimen

Record

The FEATURES/Qualifier1 section has
information that allows us to connect sequences to
a specific specimen.
(for example,
some records contain the qualifier specimen_voucher)
To get this additional information, we need to
export the data as an XML file, and parse the data
out into a useable tab delimited file.
Other good information to export

We
parsed
the
NCBI
XML
and
embedded
fields
within

<qualifiers1>
to
get
voucher,
DNA
isolate,
popula1on

variants,
country,
geographic
coordinates,
collec1on

date,
collector
name,
and
other
fields…
many

informa1ve
about
the
iden1ty
of
the
plants
sequenced.

To
make
clean
voucher
IDs,
we
used
last
name,

collec1on
number,
and
DNA
isolate
(used
by
some
labs).

For
this
analysis,
sequences
that
could
not
be
assigned
to

a
single-‐species
voucher
were
discarded.

6692
sequence
records
à

3004
individuals,
54
genes,
5846
sequences

ITS,
ETS,
matK,
trnL-‐trnF

3,370
DNA
sequences

2,196
individuals

723
spp

397
spp
>
1
individual

31.7%
of
those
spp
monophyle1c

III.
Genera1ng
maps
from

specimen
data

Carex
macloviana
D’Urv

GBIF
map,
2013-‐07-‐06

Mapping

GBIF
Data

• Generate
species
list
to
extract
GBIF

data.
(i.e.
accepted
names
in
World

Checklist)

• Download
GBIF
data
using
a
wrapper
to

dismo::gbif
(R),
allowing
us
to
capture

and
log
errors
and
missing
data.

Clean
up
downloaded
GBIF
data

•  Flag
duplicate
specimen
datasets

–  Flags
specimens
within
the
same
species
that
have

iden1cal
coordinates.

–  This
should
be
expanded
to
include
specimens
that
have

iden1cal
locality
descrip1ons.

•  Flag
imprecise
loca1on
data

–  Flags
specimens
in
which
the
la1tude
is
precise
only
to
the

degree
or
to
a
tenth
of
a
degree.

–  This
threshold
could
be
adjusted,
but
is
tailored
to
the

Worldclim
database
we
are
using
(2.5
arc
minutes).

•  Create
a
delimited
file
for
each
species
containing

specimen
data
with
flagged
columns
(reference
file
of

which
data
are
u1lized
excluded
in
mapping
step).
This

file
becomes
part
of
our
analysis
archive,
so
that
we

can
always
go
back
and
edit
or
evaluate
old
data.

Example
of
a
ﬁle
generated
from
clean_gbif

Mapping
"cleaned-‐up"
dataset

(Map_gbif_jpeg_imprecise)

•  Maps
need
to
be

manually
checked
for

accuracy
and

completeness

•  We
export
the
maps

as
images
to
a

Scratchpads
media

gallery
that
can
be

queried
or
ﬁltered
by

taxon

•  Map
reviewing
is

conducted
in
a

dedicated
SP2
forum

There
are
bugs
to
work
out,
though

Some
taxa
are
missing
data.

Example:
Carex
humilis

•  Map
of
2331
specimen
records

from
R
code
download

•  Website

individual
species

download

–  Filtered
for
specimens
with

coordinate
data

(=
7209

records)

–  Missing
records
include
some

from
France,
Japan,
&

South
Korea

Some
maps
will
need
adjustments:
in
next
itera1ons,

it
should
be
possible
to
automate
some
of
this

Carex
alata
specimen
is
missing
a
“-‐”
in
longitude
column

Carex
lanceolata
has
specimens
where
the
la1tude
and

longitude
are
switched.

In
the
end,

integra1ng
clean

coordinate
data

with
WorldClim

clima1c
data
allows

us
to
correlate

clima1c
niche

evolu1on
with

morphological
and

lineage

diversiﬁca1on*.

*
See
Thursday
talk
for
exci1ng

ﬁndings
in
subgenus
Vignea!

h{ps://mor-‐systema1cs.googlecode.com/svn/trunk/cariceae

We’ve
been
wri1ng
these
tools
in
R,

for
the
simple
reason
that
that’s
what

we
know.
Bits
could
easily
be
ported

to
PHP
for
integra1on
into

Scratchpads,
or
Python
for
web

implementa1on.

Code
is
available
at:

If
there
is
1me,
I’ll
take

ques1ons!

Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Similar to Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading (20)

More from Edward Baker

More from Edward Baker (20)

Recently uploaded

Recently uploaded (20)

Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading