IU Data Visualization Class Final Project: Visualizing Missing Species Interactions

Team: Jim Nelson, Deepak Kher, Rama Raghava Reddy, Al Armstrong
INDIANA UNIVERSITY BLOOMINGTON
Visualizing Missing Species Interactions Data
Client Project – Information Visualization

Final Project IU IVMOOC 2016 1
I. Project Title- Visualizing Missing Species Interactions Data
II. Visualization Title –Global Visualization of Missing Species
Interaction
III. Team
 Jim Nelson
 Deepak Kher
 Rama Raghava Reddy
 Al Armstrong
IV. Visualization Goals & Importance of Project
Visualization Goals and Prototype
The aim of our project is to also utilize the GloBI APIs to visualize understudied organisms and
locations with minimal interaction data within the GloBI data repository. Please see the
snapshots of expected visualization product from our project.

Importance of Project
The human population is continually growing and encroaching upon traditional wildlife habitat.
At the same time over fishing, pollution and global warming are threatening marine ecosystems.
If worldwide conservation efforts are to succeed it is imperative that we fully understand the
interactions of biological networks across the globe. We hope that our project could become an
important research tool to define knowledge gaps within the hierarchy of interactions among
species worldwide.
V. Related Work
Global Biotic Interactions (GloBI) is an open, interactive and integrated species interaction data
service (1). The goal of GloBI is to provide an infrastructure to catalog all known interactions
among existing species. GloBI provides a means for researchers to combine their biotic
datasets using automated tools that normalize, aggregate and integrate various datasets into
structured repositories (a Neo4j database) using standardized vocabularies and ontologies (2).
Currently, GloBI has cataloged nearly 1.4 million species interactions among 149,676 different
taxa gleaned from over 18,000 studies (3).
As shown in the figure below (4), GloBI is part of a network of related organizations, websites
and other data providers working to catalog and provide access to biological data. Other web
services that directly integrated with the GloBI data include the Encyclopedia of Life (EOL),
sponsored by the Smithsonian Institution and the Gulf of Mexico Species Interactions
(GoMexSI) (5, 6). In addition, a number of published studies have utilized and cited the GloBI
datasets (7).

For the last two years GloBI
has served as one of the
IVMOOC client projects. In
2014, the IVMOOC team
created a food-web map by
overlaying the GloBI
interaction data with
terrestrial and marine
ecoregion geospatial data
(4, 8). To create the viz the
team utilized several R
packages along with
Cytoscape, and Adobe
Illustrator. Last year the
IVMOOC team created the
“GloBI Explorer”, a very
nice interactive web app
geared toward middle and
high school students (4, 9-
10). Using the GloBI APIs,
species thumbnail photos
and simplified network
visualizations the team was
able to create what should
be a very effective
educational resource to get
students interested in
biology and ecology.
VI. Data Statistics
Overview
The data to be used in this project was available in three formats on the GloBi GitHub repository
(2) including Darwin Core (csv format) (11), Turtle (rdf format) (12) and Neo4J (graphdb format)
(13). The data can also be accessed using software libraries (R and javascript) (14-15) or by
accessing the API directly (16). The datasets are recreated, normalized, integrated and
exported to the various data archives such as a neo4j graphdb, darwin core archive and
rdf/turtle archive, using Maven (17) as shown in the diagram below (2).

GloBI data normalization routine
We chose to utilize the data which was available as csv files in the Darwin Core Archive format
which is the standard for biodiversity informatics data such as this (11). Six separate csv files
were downloaded and extracted from a single tarball file from the GloBI Github repository. Here
is a summary of the main variables in each file:
occurence.csv
 occurenceID = unique ID for each of the 1.4 million organism interactions
 taxonID = organism ID
 decimalLatitude
 decimalLongitude
association.csv
 occurenceID = as above
 associationID = type of interaction (i.e., predator/prey, parasitic, pathogenic, etc)
 bunch of other stuff we may or may not need
taxa.csv
 taxonID
 furtherInformationURL= web link to more info
 scientificName = latin name
reference.csv
 table showing authors and study citations for datasets
measurementOrFact.csv

 table containing data related to different physical measurements obtained for taxon
taxonCache.csv
 table containing phylogenetic hierarchy data (a.k.a., tree of life) including scientific and
common names
Data Extraction, Integration and Cleaning
We utilized a multi-pronged approach in order to extract, integration and clean the datasets.
Initially the occurrence, association and csv datasets were loaded into R, and using the R
packages dplyr (18) and tidyr (19) the occurrence.csv, association.csv and taxa.csv were joined
on the occurenceID and taxonID variables. Due to the non-uniform nature of many of the Darwin
Core variables such as occurenceID which combined multiple ID formats from the original
databases (i.e., Encylopedia of Life (EOL)(5), Global Biodiversity Information Facility
(GBIF)(20), Integrated Digitized Biocollections (iDigBio)(21), utilizing R for data cleaning
became time consuming and problematic. Ultimately these steps were performed through the
use of SQL. First the above csv files were loaded into SQL. The data of interest was then
extracted from the SQL database using the following three custom python scripts:
1. Taxon and Occurrence.py
For each taxonID in the occurrence file all the occurrences were extracted and stored in a Json
file.
Here Key is taxonID and value is list of [OccurenceID, decimallatitude, decimallongitude]

2. Occurrence and Association.py
For each occurrenceID stored in the above json file, all relevant data from the association.csv
file was retrieved such as association ID, TargetOccurrenceID, Association Type and reference
ID.
If there are any association details for that occurrence the above details will be stored, else if
occurrenceID is missing then a null value is stored for that column. These data were then
exported to another Json file.
Here Key is occurrenceID and values is list of [Association ID, TargetOccurrenceID, referenceID
and Association Type]

3. Final.py
The two Json files created from the above two files were merged and stored the final data in
final.csv
Dataset Description
Dataset description table
Total number of occurrences (interactions) 1,048,575
Number of unique occurrences 700,683
Number of unique taxon (different organisms) 108,345
Number of different taxon accounting for the
total number of unique occurrences
55,741
Percentage of taxon without interaction data 51%
Number of occurrences representing taxon
with multiple interactions
46,914
VII. Data Analysis/Visualization

Workflow
1. Data was extracted for all TaxonIDs with no associationID values from the merged datasets
using excel.
2. Data was sorted by TaxonIDs with highest number of no associations in the specific location.
3. Tableau was used to produce a geospatial visualization. Different colors in the map represent
different TaxonID with sizes of the circle indicating the number of no association records per
TaxonID in that specific location.
A) All values included
B) Cutoff of >50 applied

VIII. Discussion of Key Insights
From the initial analysis of our data and resulting visualizations it is clear that there is an
incredible lack of understanding how the vast majority of the organisms on earth interact. While
it is impressive to note that over a million species interactions have been cataloged in total, for
70% of these interactions that interaction is the only recorded interaction for at least one of the
two interacting taxa. Moreover, greater than half of all the organisms in the GloBi database
don’t have any data on even a single interaction.
From our initial visualization it appears that more research is being performed on the ecology of
the United States than other parts of the world. However, the increased data points mapping to
the US could also result from a greater participation in the GloBI project among those US
ecology researchers.
IX. Interim Analysis/Design Issues
Several issues surfaced after discussing our initial visualizations with our client Mr. Jorrit
Poelen. The main issue with our original visualization was related to the fact that multiple
common names and ID numbers (depending on which original database the data came from)
coexist in the datasets. Moreover, these discrepancies are not uniform across the original csv
files sharing common variable names. Rama also discovered a related issue that many taxonID
values in the taxa.csv file are not used on occurrence.csv. Mr. Poelen was unaware of this issue
and has listed this as a pending issue to be addressed in the GloBi GitHub (22). For these
reasons our original visualization overestimated the number of taxon with missing interaction
data. We are performing additional data cleaning to resolve these issues prior to creating our
final visualizations.
Our original design focused on geospatial visualization of taxon with missing/few interaction
data at the species level. Mr. Poelen also discussed that it would be very valuable to expand our
approach to include visualization of missing data at higher phylogenetic ranks such as at the
level of family, order or class. We are currently modifying our datasets and investigating
potential network visualization methods that may be appropriate to achieve these revised goals.
X. Challenges and Opportunities
There are many challenges we faced thus far in this project, chief among these is the
complexity and non-uniformity of the data, including many variables with combined string and
numeric values and multiple ID values associated with each of the >19 data sources. After
several discussion with the client, due to time constraints and for the sake of simplicity, we have
revised our original plan to focus on just the data from the largest data source; the Integrated
Taxonomic Information System (23).
It was our original aim to create a tool that biologists could utilize to better understand which
organism and ecosystems are understudied throughout the world. Given the valuable input from
our discussions with our client Mr. Poelen, we are confident that after several modifications to
our study design, we will still be able to produce a visualization(s) that will convey this important
information in an informative and compelling manner.

References
1. Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic
Interactions: An open infrastructure to share and analyze species-interaction datasets.
Ecological Informatics.http://dx.doi.org/10.1016/j.ecoinf.2014.08.005 (Links to an
external site.)
2. https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data
3. http://www.globalbioticinteractions.org/references.html
4. http://blog.globalbioticinteractions.org/
5. http://eol.org/
6. http://gomexsi.tamucc.edu/
7. http://www.globalbioticinteractions.org/about.html
8. Slyusarev, Sergey; Kontopoulos, Dimitrios-Georgios; Taysom, William; Guzman, Adrian;
Wadhwa, Bimlesh (2015): Global Biotic Interactions food web map.
https://figshare.com/articles/Global_Biotic_Interactions_food_web_map/1297762
9. http://danielabar.github.io/globi-proto/#/landing
10. https://figshare.com/articles/GloBI_Explorer_Interactive_Ecosystem_Explorer/1414253/1
11. https://en.wikipedia.org/wiki/Darwin_Core_Archive
12. https://www.w3.org/TeamSubmission/turtle/
13. http://neo4j.com/
14. https://cran.r-project.org/web/packages/rglobi/
15. https://www.npmjs.com/package/globi-data
16. https://github.com/jhpoelen/eol-globi-data/wiki/API
17. https://maven.apache.org/guides/introduction/introduction-to-repositories.html
18. https://cran.r-project.org/web/packages/dplyr/
19. https://cran.r-project.org/web/packages/tidyr/index.html
20. http://www.gbif.org/
21. https://www.idigbio.org/
22. https://github.com/jhpoelen/eol-globi-data/issues/220
23. http://www.itis.gov/
Colleagues

IU Data Visualization Class Final Project: Visualizing Missing Species Interactions

Recommended

Recommended

More Related Content

Similar to IU Data Visualization Class Final Project: Visualizing Missing Species Interactions

Similar to IU Data Visualization Class Final Project: Visualizing Missing Species Interactions (20)

More from James Nelson

More from James Nelson (12)

Recently uploaded

Recently uploaded (20)

IU Data Visualization Class Final Project: Visualizing Missing Species Interactions