GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily reflect the views of
the National Science Foundation.
GUODA: A Unified Platform for
Large-Scale Computational
Research on Open-Access
Biodiversity Data
Matthew Collins, Alexander Thompson, Jorrit
Poelen, Jennifer Hammock

2
What is GUODA?
Global Unified Open Data Access
An informal collaboration between technologists from
organizations like EOL , ePANDDA, and iDigBio as well as
independent biodiversity informaticists. We share data use
cases, best practices, infrastructure, code, and ideas
around the science that can be done by analyzing large open-
access biodiversity datasets.
http://guoda.bio

3
What our members are interested in
Computation with biodiversity data
• Research at scale
• Lowering barriers to accessing computation
• Reproducibility
Matthew Collins
Technical Operations
Manager - iDigBio
Jorrit Poelen
Independant
Alexander Thompson
Software Products
Lead - iDigBio
Jennifer Hammock
Marine Theme
Coordinator - EOL
Nathan Bird
Software
Developer - iDigBio

4
An example use of GUODA
Does anyone use catalog numbers in
remarks fields to document relationships
between specimen records in iDigBio?
(We’re at TDWG so we’ve got to do
something with identifiers, right?)

5
A term-document index of iDigBio
(idb_df
.select(idb_df["uuid"],
idb_df["uuid"])
.where(sql.column("note") != "")
.withColumn("tokens",
udf_tokenize(sql.column("note")))
.select(sql.explode(sql.column("tokens")))
.groupBy(sql.column("uuid"),
sql.column("token"))
.count()
)

6
What terms match catalognumber?
joined = (idb_df_ids
.join(idb_tf_df,
on=idb_df_ids["idb_catalognumber"]
== idb_tf_df["token"])
.join(idb_df_notes,
on=sql.column("uuid") ==
idb_df_notes["note_uuid"])
.withColumn("catalognumber_len",
sql.length(sql.column("idb_catalognumber")
))
)

7
What do we find?
A few things like record
bd347847…
Has a remark
Part of Collection at FH:
barcode-00374180.
Which matches record
826da57a...
Histogram of matching
catalognumber length

8
How long did that take to write?
< 200 lines of code (including whitespace
and comments)
1 intermittent day of coding
https://github.com/iDigBio/idb-spark

9
How long did that take to run?
73.5 million records in iDigBio
to 151 million document:term:counts
40 minutes
Joined back to iDigBio resulting in 2.9 billion
terms found in the catalognumber field
3 hours 40 minutes

10
Good tools in the hands of people
with good ideas:
IDEAS RESULTSWORK

11
Servers!
Mesos
HDFS
Spark
Marathon
Docker
Cassandra
Infrastructure
Advanced Computing and Information Systems Lab
http://acis.ufl.edu

12
Data is half the tool
Copies of whole datasets
• Stored locally
• Refreshed automatically
Re-represent datasets in a useful structure for
high performance computing - parquet on
HDFS:
https://github.com/bio-guoda/guoda-datasets

13
Interfaces to GUODA
• Jupyter Notebooks for end-users
• Containers for API and web services
• Persistent storage for application state
• Hangouts calls every 2-4 weeks

14
The front door to GUODA
Notebooks
“Literate Programming”
Comments, code, and outputs all
together in a readable document
that describes what is being done

15
Here’s what it looks like

16
GUODA Jupyter notebook interface

17
What would you do with it?
Have a Github account and want to write
code? This is an alpha quality system.
http://jupyter.idigbio.org
Or talk to us if you want to host an
application on our systems
mcollins@acis.ufl.edu godfoder@acis.ufl.edu

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily reflect the views of
the National Science Foundation.
idigbio.org/wiki
facebook.com/iDigBio
twitter.com/iDigBio
vimeo.com/iDigBio
idigbio.org/rss-feed.xml
idigbio.org/events-calendar/export.ics
Get involved!

GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data

Recommended

Recommended

More Related Content

Similar to GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data

Similar to GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data (20)

Recently uploaded

Recently uploaded (20)

GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data