SlideShare a Scribd company logo
Team: Jim Nelson, Deepak Kher, Rama Raghava Reddy, Al Armstrong
INDIANA UNIVERSITY BLOOMINGTON
Visualizing Missing Species Interactions Data
Client Project – Information Visualization
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 1
I. Project Title- Visualizing Missing Species Interactions Data
II. Visualization Title –Global Visualization of Missing Species
Interaction
III. Team
 Jim Nelson
 Deepak Kher
 Rama Raghava Reddy
 Al Armstrong
IV. Visualization Goals & Importance of Project
Visualization Goals and Prototype
The aim of our project is to also utilize the GloBI APIs to visualize understudied organisms and
locations with minimal interaction data within the GloBI data repository. Please see the
snapshots of expected visualization product from our project.
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 2
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 3
Importance of Project
The human population is continually growing and encroaching upon traditional wildlife habitat.
At the same time over fishing, pollution and global warming are threatening marine ecosystems.
If worldwide conservation efforts are to succeed it is imperative that we fully understand the
interactions of biological networks across the globe. We hope that our project could become an
important research tool to define knowledge gaps within the hierarchy of interactions among
species worldwide.
V. Related Work
Global Biotic Interactions (GloBI) is an open, interactive and integrated species interaction data
service (1). The goal of GloBI is to provide an infrastructure to catalog all known interactions
among existing species. GloBI provides a means for researchers to combine their biotic
datasets using automated tools that normalize, aggregate and integrate various datasets into
structured repositories (a Neo4j database) using standardized vocabularies and ontologies (2).
Currently, GloBI has cataloged nearly 1.4 million species interactions among 149,676 different
taxa gleaned from over 18,000 studies (3).
As shown in the figure below (4), GloBI is part of a network of related organizations, websites
and other data providers working to catalog and provide access to biological data. Other web
services that directly integrated with the GloBI data include the Encyclopedia of Life (EOL),
sponsored by the Smithsonian Institution and the Gulf of Mexico Species Interactions
(GoMexSI) (5, 6). In addition, a number of published studies have utilized and cited the GloBI
datasets (7).
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 4
For the last two years GloBI
has served as one of the
IVMOOC client projects. In
2014, the IVMOOC team
created a food-web map by
overlaying the GloBI
interaction data with
terrestrial and marine
ecoregion geospatial data
(4, 8). To create the viz the
team utilized several R
packages along with
Cytoscape, and Adobe
Illustrator. Last year the
IVMOOC team created the
“GloBI Explorer”, a very
nice interactive web app
geared toward middle and
high school students (4, 9-
10). Using the GloBI APIs,
species thumbnail photos
and simplified network
visualizations the team was
able to create what should
be a very effective
educational resource to get
students interested in
biology and ecology.
VI. Data Statistics
Overview
The data to be used in this project was available in three formats on the GloBi GitHub repository
(2) including Darwin Core (csv format) (11), Turtle (rdf format) (12) and Neo4J (graphdb format)
(13). The data can also be accessed using software libraries (R and javascript) (14-15) or by
accessing the API directly (16). The datasets are recreated, normalized, integrated and
exported to the various data archives such as a neo4j graphdb, darwin core archive and
rdf/turtle archive, using Maven (17) as shown in the diagram below (2).
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 5
GloBI data normalization routine
We chose to utilize the data which was available as csv files in the Darwin Core Archive format
which is the standard for biodiversity informatics data such as this (11). Six separate csv files
were downloaded and extracted from a single tarball file from the GloBI Github repository. Here
is a summary of the main variables in each file:
occurence.csv
 occurenceID = unique ID for each of the 1.4 million organism interactions
 taxonID = organism ID
 decimalLatitude
 decimalLongitude
association.csv
 occurenceID = as above
 associationID = type of interaction (i.e., predator/prey, parasitic, pathogenic, etc)
 bunch of other stuff we may or may not need
taxa.csv
 taxonID
 furtherInformationURL= web link to more info
 scientificName = latin name
reference.csv
 table showing authors and study citations for datasets
measurementOrFact.csv
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 6
 table containing data related to different physical measurements obtained for taxon
taxonCache.csv
 table containing phylogenetic hierarchy data (a.k.a., tree of life) including scientific and
common names
Data Extraction, Integration and Cleaning
We utilized a multi-pronged approach in order to extract, integration and clean the datasets.
Initially the occurrence, association and csv datasets were loaded into R, and using the R
packages dplyr (18) and tidyr (19) the occurrence.csv, association.csv and taxa.csv were joined
on the occurenceID and taxonID variables. Due to the non-uniform nature of many of the Darwin
Core variables such as occurenceID which combined multiple ID formats from the original
databases (i.e., Encylopedia of Life (EOL)(5), Global Biodiversity Information Facility
(GBIF)(20), Integrated Digitized Biocollections (iDigBio)(21), utilizing R for data cleaning
became time consuming and problematic. Ultimately these steps were performed through the
use of SQL. First the above csv files were loaded into SQL. The data of interest was then
extracted from the SQL database using the following three custom python scripts:
1. Taxon and Occurrence.py
For each taxonID in the occurrence file all the occurrences were extracted and stored in a Json
file.
Here Key is taxonID and value is list of [OccurenceID, decimallatitude, decimallongitude]
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 7
2. Occurrence and Association.py
For each occurrenceID stored in the above json file, all relevant data from the association.csv
file was retrieved such as association ID, TargetOccurrenceID, Association Type and reference
ID.
If there are any association details for that occurrence the above details will be stored, else if
occurrenceID is missing then a null value is stored for that column. These data were then
exported to another Json file.
Here Key is occurrenceID and values is list of [Association ID, TargetOccurrenceID, referenceID
and Association Type]
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 8
3. Final.py
The two Json files created from the above two files were merged and stored the final data in
final.csv
Dataset Description
Dataset description table
Total number of occurrences (interactions) 1,048,575
Number of unique occurrences 700,683
Number of unique taxon (different organisms) 108,345
Number of different taxon accounting for the
total number of unique occurrences
55,741
Percentage of taxon without interaction data 51%
Number of occurrences representing taxon
with multiple interactions
46,914
VII. Data Analysis/Visualization
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 9
Workflow
1. Data was extracted for all TaxonIDs with no associationID values from the merged datasets
using excel.
2. Data was sorted by TaxonIDs with highest number of no associations in the specific location.
3. Tableau was used to produce a geospatial visualization. Different colors in the map represent
different TaxonID with sizes of the circle indicating the number of no association records per
TaxonID in that specific location.
A) All values included
B) Cutoff of >50 applied
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 10
VIII. Discussion of Key Insights
From the initial analysis of our data and resulting visualizations it is clear that there is an
incredible lack of understanding how the vast majority of the organisms on earth interact. While
it is impressive to note that over a million species interactions have been cataloged in total, for
70% of these interactions that interaction is the only recorded interaction for at least one of the
two interacting taxa. Moreover, greater than half of all the organisms in the GloBi database
don’t have any data on even a single interaction.
From our initial visualization it appears that more research is being performed on the ecology of
the United States than other parts of the world. However, the increased data points mapping to
the US could also result from a greater participation in the GloBI project among those US
ecology researchers.
IX. Interim Analysis/Design Issues
Several issues surfaced after discussing our initial visualizations with our client Mr. Jorrit
Poelen. The main issue with our original visualization was related to the fact that multiple
common names and ID numbers (depending on which original database the data came from)
coexist in the datasets. Moreover, these discrepancies are not uniform across the original csv
files sharing common variable names. Rama also discovered a related issue that many taxonID
values in the taxa.csv file are not used on occurrence.csv. Mr. Poelen was unaware of this issue
and has listed this as a pending issue to be addressed in the GloBi GitHub (22). For these
reasons our original visualization overestimated the number of taxon with missing interaction
data. We are performing additional data cleaning to resolve these issues prior to creating our
final visualizations.
Our original design focused on geospatial visualization of taxon with missing/few interaction
data at the species level. Mr. Poelen also discussed that it would be very valuable to expand our
approach to include visualization of missing data at higher phylogenetic ranks such as at the
level of family, order or class. We are currently modifying our datasets and investigating
potential network visualization methods that may be appropriate to achieve these revised goals.
X. Challenges and Opportunities
There are many challenges we faced thus far in this project, chief among these is the
complexity and non-uniformity of the data, including many variables with combined string and
numeric values and multiple ID values associated with each of the >19 data sources. After
several discussion with the client, due to time constraints and for the sake of simplicity, we have
revised our original plan to focus on just the data from the largest data source; the Integrated
Taxonomic Information System (23).
It was our original aim to create a tool that biologists could utilize to better understand which
organism and ecosystems are understudied throughout the world. Given the valuable input from
our discussions with our client Mr. Poelen, we are confident that after several modifications to
our study design, we will still be able to produce a visualization(s) that will convey this important
information in an informative and compelling manner.
Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 11
References
1. Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic
Interactions: An open infrastructure to share and analyze species-interaction datasets.
Ecological Informatics.http://dx.doi.org/10.1016/j.ecoinf.2014.08.005 (Links to an
external site.)
2. https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data
3. http://www.globalbioticinteractions.org/references.html
4. http://blog.globalbioticinteractions.org/
5. http://eol.org/
6. http://gomexsi.tamucc.edu/
7. http://www.globalbioticinteractions.org/about.html
8. Slyusarev, Sergey; Kontopoulos, Dimitrios-Georgios; Taysom, William; Guzman, Adrian;
Wadhwa, Bimlesh (2015): Global Biotic Interactions food web map.
https://figshare.com/articles/Global_Biotic_Interactions_food_web_map/1297762
9. http://danielabar.github.io/globi-proto/#/landing
10. https://figshare.com/articles/GloBI_Explorer_Interactive_Ecosystem_Explorer/1414253/1
11. https://en.wikipedia.org/wiki/Darwin_Core_Archive
12. https://www.w3.org/TeamSubmission/turtle/
13. http://neo4j.com/
14. https://cran.r-project.org/web/packages/rglobi/
15. https://www.npmjs.com/package/globi-data
16. https://github.com/jhpoelen/eol-globi-data/wiki/API
17. https://maven.apache.org/guides/introduction/introduction-to-repositories.html
18. https://cran.r-project.org/web/packages/dplyr/
19. https://cran.r-project.org/web/packages/tidyr/index.html
20. http://www.gbif.org/
21. https://www.idigbio.org/
22. https://github.com/jhpoelen/eol-globi-data/issues/220
23. http://www.itis.gov/
Colleagues

More Related Content

Similar to IU Data Visualization Class Final Project: Visualizing Missing Species Interactions

2015 Summer - Araport Project Overview Leaflet
2015 Summer - Araport Project Overview Leaflet2015 Summer - Araport Project Overview Leaflet
2015 Summer - Araport Project Overview Leaflet
Araport
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
Edward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
Vince Smith
 
iBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database SearchiBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database Search
The Children's Hospital of Philadelphia
 
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...
Dag Endresen
 
Data and science
Data and scienceData and science
Data and science
Anand Deshpande
 
TDWG at the University of Tasmania
TDWG at the University of TasmaniaTDWG at the University of Tasmania
TDWG at the University of Tasmania
leebel
 
World bank 2011-05
World bank 2011-05World bank 2011-05
World bank 2011-05
Johannes Keizer
 
Presentationonline
PresentationonlinePresentationonline
Presentationonline
kashif Iqbal Kashif.Iqbal.Shah
 
Zookeyeditorial
ZookeyeditorialZookeyeditorial
Zookeyeditorial
Vishwas Chavan
 
IBC FAIR Data Prototype Implementation slideshow
IBC FAIR Data Prototype Implementation   slideshowIBC FAIR Data Prototype Implementation   slideshow
IBC FAIR Data Prototype Implementation slideshow
Mark Wilkinson
 
Web services for sharing germplasm data sets, at FAO in Rome (2006)
Web services for sharing germplasm data sets, at FAO in Rome (2006)Web services for sharing germplasm data sets, at FAO in Rome (2006)
Web services for sharing germplasm data sets, at FAO in Rome (2006)
Dag Endresen
 
Penev, L et al. Publ Dissem Data Zookeys 06 01 09
Penev, L et al. Publ Dissem Data Zookeys 06 01 09Penev, L et al. Publ Dissem Data Zookeys 06 01 09
Penev, L et al. Publ Dissem Data Zookeys 06 01 09
Tom Moritz
 
Scratchpads introductory presentation 45mins
Scratchpads introductory presentation   45minsScratchpads introductory presentation   45mins
Scratchpads introductory presentation 45mins
Dimitrios Koureas
 
AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011
Alex Hardisty
 
Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)
Oscar Corcho
 
BiSciCol ievobio
BiSciCol ievobioBiSciCol ievobio
BiSciCol ievobio
John Deck
 
2 Discovery and Acquisition of Data1.pptx
2 Discovery and Acquisition of Data1.pptx2 Discovery and Acquisition of Data1.pptx
2 Discovery and Acquisition of Data1.pptx
vijayapraba1
 
Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1
IPLODProject
 
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseData integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
IJDKP
 

Similar to IU Data Visualization Class Final Project: Visualizing Missing Species Interactions (20)

2015 Summer - Araport Project Overview Leaflet
2015 Summer - Araport Project Overview Leaflet2015 Summer - Araport Project Overview Leaflet
2015 Summer - Araport Project Overview Leaflet
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
iBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database SearchiBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database Search
 
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...
 
Data and science
Data and scienceData and science
Data and science
 
TDWG at the University of Tasmania
TDWG at the University of TasmaniaTDWG at the University of Tasmania
TDWG at the University of Tasmania
 
World bank 2011-05
World bank 2011-05World bank 2011-05
World bank 2011-05
 
Presentationonline
PresentationonlinePresentationonline
Presentationonline
 
Zookeyeditorial
ZookeyeditorialZookeyeditorial
Zookeyeditorial
 
IBC FAIR Data Prototype Implementation slideshow
IBC FAIR Data Prototype Implementation   slideshowIBC FAIR Data Prototype Implementation   slideshow
IBC FAIR Data Prototype Implementation slideshow
 
Web services for sharing germplasm data sets, at FAO in Rome (2006)
Web services for sharing germplasm data sets, at FAO in Rome (2006)Web services for sharing germplasm data sets, at FAO in Rome (2006)
Web services for sharing germplasm data sets, at FAO in Rome (2006)
 
Penev, L et al. Publ Dissem Data Zookeys 06 01 09
Penev, L et al. Publ Dissem Data Zookeys 06 01 09Penev, L et al. Publ Dissem Data Zookeys 06 01 09
Penev, L et al. Publ Dissem Data Zookeys 06 01 09
 
Scratchpads introductory presentation 45mins
Scratchpads introductory presentation   45minsScratchpads introductory presentation   45mins
Scratchpads introductory presentation 45mins
 
AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011
 
Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)
 
BiSciCol ievobio
BiSciCol ievobioBiSciCol ievobio
BiSciCol ievobio
 
2 Discovery and Acquisition of Data1.pptx
2 Discovery and Acquisition of Data1.pptx2 Discovery and Acquisition of Data1.pptx
2 Discovery and Acquisition of Data1.pptx
 
Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1
 
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseData integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
 

More from James Nelson

IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
James Nelson
 
JN resumeDS 050516
JN resumeDS 050516JN resumeDS 050516
JN resumeDS 050516
James Nelson
 
CGFP proposal
CGFP proposal CGFP proposal
CGFP proposal
James Nelson
 
Easl immuno poster
Easl immuno posterEasl immuno poster
Easl immuno poster
James Nelson
 
Pufa protocol
Pufa protocol Pufa protocol
Pufa protocol
James Nelson
 
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
James Nelson
 
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
James Nelson
 
Serum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in AdultsSerum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in Adults
James Nelson
 
Twitter Dataset Analysis and Geocoding
Twitter Dataset Analysis and Geocoding Twitter Dataset Analysis and Geocoding
Twitter Dataset Analysis and Geocoding
James Nelson
 
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
James Nelson
 
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver diseaseSerum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
James Nelson
 
James Nelson CV 33016
James Nelson CV 33016James Nelson CV 33016
James Nelson CV 33016
James Nelson
 

More from James Nelson (12)

IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
 
JN resumeDS 050516
JN resumeDS 050516JN resumeDS 050516
JN resumeDS 050516
 
CGFP proposal
CGFP proposal CGFP proposal
CGFP proposal
 
Easl immuno poster
Easl immuno posterEasl immuno poster
Easl immuno poster
 
Pufa protocol
Pufa protocol Pufa protocol
Pufa protocol
 
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
 
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
 
Serum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in AdultsSerum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in Adults
 
Twitter Dataset Analysis and Geocoding
Twitter Dataset Analysis and Geocoding Twitter Dataset Analysis and Geocoding
Twitter Dataset Analysis and Geocoding
 
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
 
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver diseaseSerum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
 
James Nelson CV 33016
James Nelson CV 33016James Nelson CV 33016
James Nelson CV 33016
 

Recently uploaded

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 

Recently uploaded (20)

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 

IU Data Visualization Class Final Project: Visualizing Missing Species Interactions

  • 1. Team: Jim Nelson, Deepak Kher, Rama Raghava Reddy, Al Armstrong INDIANA UNIVERSITY BLOOMINGTON Visualizing Missing Species Interactions Data Client Project – Information Visualization
  • 2. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 1 I. Project Title- Visualizing Missing Species Interactions Data II. Visualization Title –Global Visualization of Missing Species Interaction III. Team  Jim Nelson  Deepak Kher  Rama Raghava Reddy  Al Armstrong IV. Visualization Goals & Importance of Project Visualization Goals and Prototype The aim of our project is to also utilize the GloBI APIs to visualize understudied organisms and locations with minimal interaction data within the GloBI data repository. Please see the snapshots of expected visualization product from our project.
  • 3. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 2
  • 4. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 3 Importance of Project The human population is continually growing and encroaching upon traditional wildlife habitat. At the same time over fishing, pollution and global warming are threatening marine ecosystems. If worldwide conservation efforts are to succeed it is imperative that we fully understand the interactions of biological networks across the globe. We hope that our project could become an important research tool to define knowledge gaps within the hierarchy of interactions among species worldwide. V. Related Work Global Biotic Interactions (GloBI) is an open, interactive and integrated species interaction data service (1). The goal of GloBI is to provide an infrastructure to catalog all known interactions among existing species. GloBI provides a means for researchers to combine their biotic datasets using automated tools that normalize, aggregate and integrate various datasets into structured repositories (a Neo4j database) using standardized vocabularies and ontologies (2). Currently, GloBI has cataloged nearly 1.4 million species interactions among 149,676 different taxa gleaned from over 18,000 studies (3). As shown in the figure below (4), GloBI is part of a network of related organizations, websites and other data providers working to catalog and provide access to biological data. Other web services that directly integrated with the GloBI data include the Encyclopedia of Life (EOL), sponsored by the Smithsonian Institution and the Gulf of Mexico Species Interactions (GoMexSI) (5, 6). In addition, a number of published studies have utilized and cited the GloBI datasets (7).
  • 5. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 4 For the last two years GloBI has served as one of the IVMOOC client projects. In 2014, the IVMOOC team created a food-web map by overlaying the GloBI interaction data with terrestrial and marine ecoregion geospatial data (4, 8). To create the viz the team utilized several R packages along with Cytoscape, and Adobe Illustrator. Last year the IVMOOC team created the “GloBI Explorer”, a very nice interactive web app geared toward middle and high school students (4, 9- 10). Using the GloBI APIs, species thumbnail photos and simplified network visualizations the team was able to create what should be a very effective educational resource to get students interested in biology and ecology. VI. Data Statistics Overview The data to be used in this project was available in three formats on the GloBi GitHub repository (2) including Darwin Core (csv format) (11), Turtle (rdf format) (12) and Neo4J (graphdb format) (13). The data can also be accessed using software libraries (R and javascript) (14-15) or by accessing the API directly (16). The datasets are recreated, normalized, integrated and exported to the various data archives such as a neo4j graphdb, darwin core archive and rdf/turtle archive, using Maven (17) as shown in the diagram below (2).
  • 6. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 5 GloBI data normalization routine We chose to utilize the data which was available as csv files in the Darwin Core Archive format which is the standard for biodiversity informatics data such as this (11). Six separate csv files were downloaded and extracted from a single tarball file from the GloBI Github repository. Here is a summary of the main variables in each file: occurence.csv  occurenceID = unique ID for each of the 1.4 million organism interactions  taxonID = organism ID  decimalLatitude  decimalLongitude association.csv  occurenceID = as above  associationID = type of interaction (i.e., predator/prey, parasitic, pathogenic, etc)  bunch of other stuff we may or may not need taxa.csv  taxonID  furtherInformationURL= web link to more info  scientificName = latin name reference.csv  table showing authors and study citations for datasets measurementOrFact.csv
  • 7. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 6  table containing data related to different physical measurements obtained for taxon taxonCache.csv  table containing phylogenetic hierarchy data (a.k.a., tree of life) including scientific and common names Data Extraction, Integration and Cleaning We utilized a multi-pronged approach in order to extract, integration and clean the datasets. Initially the occurrence, association and csv datasets were loaded into R, and using the R packages dplyr (18) and tidyr (19) the occurrence.csv, association.csv and taxa.csv were joined on the occurenceID and taxonID variables. Due to the non-uniform nature of many of the Darwin Core variables such as occurenceID which combined multiple ID formats from the original databases (i.e., Encylopedia of Life (EOL)(5), Global Biodiversity Information Facility (GBIF)(20), Integrated Digitized Biocollections (iDigBio)(21), utilizing R for data cleaning became time consuming and problematic. Ultimately these steps were performed through the use of SQL. First the above csv files were loaded into SQL. The data of interest was then extracted from the SQL database using the following three custom python scripts: 1. Taxon and Occurrence.py For each taxonID in the occurrence file all the occurrences were extracted and stored in a Json file. Here Key is taxonID and value is list of [OccurenceID, decimallatitude, decimallongitude]
  • 8. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 7 2. Occurrence and Association.py For each occurrenceID stored in the above json file, all relevant data from the association.csv file was retrieved such as association ID, TargetOccurrenceID, Association Type and reference ID. If there are any association details for that occurrence the above details will be stored, else if occurrenceID is missing then a null value is stored for that column. These data were then exported to another Json file. Here Key is occurrenceID and values is list of [Association ID, TargetOccurrenceID, referenceID and Association Type]
  • 9. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 8 3. Final.py The two Json files created from the above two files were merged and stored the final data in final.csv Dataset Description Dataset description table Total number of occurrences (interactions) 1,048,575 Number of unique occurrences 700,683 Number of unique taxon (different organisms) 108,345 Number of different taxon accounting for the total number of unique occurrences 55,741 Percentage of taxon without interaction data 51% Number of occurrences representing taxon with multiple interactions 46,914 VII. Data Analysis/Visualization
  • 10. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 9 Workflow 1. Data was extracted for all TaxonIDs with no associationID values from the merged datasets using excel. 2. Data was sorted by TaxonIDs with highest number of no associations in the specific location. 3. Tableau was used to produce a geospatial visualization. Different colors in the map represent different TaxonID with sizes of the circle indicating the number of no association records per TaxonID in that specific location. A) All values included B) Cutoff of >50 applied
  • 11. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 10 VIII. Discussion of Key Insights From the initial analysis of our data and resulting visualizations it is clear that there is an incredible lack of understanding how the vast majority of the organisms on earth interact. While it is impressive to note that over a million species interactions have been cataloged in total, for 70% of these interactions that interaction is the only recorded interaction for at least one of the two interacting taxa. Moreover, greater than half of all the organisms in the GloBi database don’t have any data on even a single interaction. From our initial visualization it appears that more research is being performed on the ecology of the United States than other parts of the world. However, the increased data points mapping to the US could also result from a greater participation in the GloBI project among those US ecology researchers. IX. Interim Analysis/Design Issues Several issues surfaced after discussing our initial visualizations with our client Mr. Jorrit Poelen. The main issue with our original visualization was related to the fact that multiple common names and ID numbers (depending on which original database the data came from) coexist in the datasets. Moreover, these discrepancies are not uniform across the original csv files sharing common variable names. Rama also discovered a related issue that many taxonID values in the taxa.csv file are not used on occurrence.csv. Mr. Poelen was unaware of this issue and has listed this as a pending issue to be addressed in the GloBi GitHub (22). For these reasons our original visualization overestimated the number of taxon with missing interaction data. We are performing additional data cleaning to resolve these issues prior to creating our final visualizations. Our original design focused on geospatial visualization of taxon with missing/few interaction data at the species level. Mr. Poelen also discussed that it would be very valuable to expand our approach to include visualization of missing data at higher phylogenetic ranks such as at the level of family, order or class. We are currently modifying our datasets and investigating potential network visualization methods that may be appropriate to achieve these revised goals. X. Challenges and Opportunities There are many challenges we faced thus far in this project, chief among these is the complexity and non-uniformity of the data, including many variables with combined string and numeric values and multiple ID values associated with each of the >19 data sources. After several discussion with the client, due to time constraints and for the sake of simplicity, we have revised our original plan to focus on just the data from the largest data source; the Integrated Taxonomic Information System (23). It was our original aim to create a tool that biologists could utilize to better understand which organism and ecosystems are understudied throughout the world. Given the valuable input from our discussions with our client Mr. Poelen, we are confident that after several modifications to our study design, we will still be able to produce a visualization(s) that will convey this important information in an informative and compelling manner.
  • 12. Visualizing Missing Species Interactions Data Final Project IU IVMOOC 2016 11 References 1. Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics.http://dx.doi.org/10.1016/j.ecoinf.2014.08.005 (Links to an external site.) 2. https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data 3. http://www.globalbioticinteractions.org/references.html 4. http://blog.globalbioticinteractions.org/ 5. http://eol.org/ 6. http://gomexsi.tamucc.edu/ 7. http://www.globalbioticinteractions.org/about.html 8. Slyusarev, Sergey; Kontopoulos, Dimitrios-Georgios; Taysom, William; Guzman, Adrian; Wadhwa, Bimlesh (2015): Global Biotic Interactions food web map. https://figshare.com/articles/Global_Biotic_Interactions_food_web_map/1297762 9. http://danielabar.github.io/globi-proto/#/landing 10. https://figshare.com/articles/GloBI_Explorer_Interactive_Ecosystem_Explorer/1414253/1 11. https://en.wikipedia.org/wiki/Darwin_Core_Archive 12. https://www.w3.org/TeamSubmission/turtle/ 13. http://neo4j.com/ 14. https://cran.r-project.org/web/packages/rglobi/ 15. https://www.npmjs.com/package/globi-data 16. https://github.com/jhpoelen/eol-globi-data/wiki/API 17. https://maven.apache.org/guides/introduction/introduction-to-repositories.html 18. https://cran.r-project.org/web/packages/dplyr/ 19. https://cran.r-project.org/web/packages/tidyr/index.html 20. http://www.gbif.org/ 21. https://www.idigbio.org/ 22. https://github.com/jhpoelen/eol-globi-data/issues/220 23. http://www.itis.gov/ Colleagues