The aim of our project is to also utilize the GloBI APIs to visualize understudied organisms and locations with minimal interaction data within the GloBI data repository.
1. The document discusses issues with agricultural information systems like different user needs, multiple data sources, and lack of interoperability.
2. It proposes using shared vocabularies, ontologies, and application profiles like AGRIS AP and AgMES to enable semantic interoperability across systems through a common exchange layer.
3. The Agricultural Ontology Service aims to improve semantic search and access to agricultural knowledge resources by providing a registry and federated storage for vocabularies, ontologies, and other knowledge organization systems like AGROVOC.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
BHL is a digital library that provides open access to biodiversity literature. It contains over 33,000 volumes and 13.3 million pages that are digitized from partner institutions. Usage is growing, with 45,000 unique users and 250,000 page views per month. BHL faces challenges in managing the distributed digitized content from partners and improving technologies like OCR and name recognition. It provides open APIs and data to enable discovery and sharing of content.
This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
The document summarizes a presentation about using large graph databases for chemical similarity searching. It describes building a graph database of 68 billion molecular substructures from 340 million molecules and using graph edit distance to perform sublinear-scaling searches through the database to identify similar molecules. This approach scales better to large datasets than traditional fingerprint-based similarity methods.
Learning Multilingual Semantics from Big Data on the WebGerard de Melo
This document summarizes Gerard de Melo's presentation on learning multilingual semantics from big data on the web. It discusses how lexical and taxonomic knowledge can be extracted at large scale from online resources like Wiktionary, Wikipedia, and WordNet. Methods are presented for merging structured data like knowledge graphs and integrating taxonomies across languages using techniques like linear program relaxation and belief propagation. The goal is to build large yet reasonably clean multilingual knowledge bases to power applications in areas like semantic search and the digital humanities.
The document summarizes an open genomic data project called OpenFlyData that links and integrates gene expression data from multiple sources using semantic web technologies. It describes how RDF and SPARQL are used to query linked data from sources like FlyBase, BDGP and FlyTED. It also discusses applications built on top of the linked data as well as performance and challenges of the system.
1. The document discusses issues with agricultural information systems like different user needs, multiple data sources, and lack of interoperability.
2. It proposes using shared vocabularies, ontologies, and application profiles like AGRIS AP and AgMES to enable semantic interoperability across systems through a common exchange layer.
3. The Agricultural Ontology Service aims to improve semantic search and access to agricultural knowledge resources by providing a registry and federated storage for vocabularies, ontologies, and other knowledge organization systems like AGROVOC.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
BHL is a digital library that provides open access to biodiversity literature. It contains over 33,000 volumes and 13.3 million pages that are digitized from partner institutions. Usage is growing, with 45,000 unique users and 250,000 page views per month. BHL faces challenges in managing the distributed digitized content from partners and improving technologies like OCR and name recognition. It provides open APIs and data to enable discovery and sharing of content.
This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
The document summarizes a presentation about using large graph databases for chemical similarity searching. It describes building a graph database of 68 billion molecular substructures from 340 million molecules and using graph edit distance to perform sublinear-scaling searches through the database to identify similar molecules. This approach scales better to large datasets than traditional fingerprint-based similarity methods.
Learning Multilingual Semantics from Big Data on the WebGerard de Melo
This document summarizes Gerard de Melo's presentation on learning multilingual semantics from big data on the web. It discusses how lexical and taxonomic knowledge can be extracted at large scale from online resources like Wiktionary, Wikipedia, and WordNet. Methods are presented for merging structured data like knowledge graphs and integrating taxonomies across languages using techniques like linear program relaxation and belief propagation. The goal is to build large yet reasonably clean multilingual knowledge bases to power applications in areas like semantic search and the digital humanities.
The document summarizes an open genomic data project called OpenFlyData that links and integrates gene expression data from multiple sources using semantic web technologies. It describes how RDF and SPARQL are used to query linked data from sources like FlyBase, BDGP and FlyTED. It also discusses applications built on top of the linked data as well as performance and challenges of the system.
Araport is an online resource for Arabidopsis and plant research that integrates various types of data from different sources. It provides genome annotation for Arabidopsis that has been validated and updated using RNA-seq data. Data is stored and can be accessed through the ThaleMine data warehouse. Araport also features a JBrowse genome viewer and Science Apps that retrieve real-time data through web services. It is an open source project that welcomes community contributions and holds workshops to support developers.
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
This document summarizes a presentation about the Natural History Museum's (NHM) efforts to create a centralized data portal and move towards a "Graph of Life" by connecting their collected data. It describes the large number and variety of objects in the NHM collections, efforts to digitize specimens, and challenges with previous disconnected digital access systems. The new NHM Data Portal aims to make data discovery and access easier through an open-source CKAN platform, providing over 3.7 million records and APIs. It discusses using the portal and linked open data approaches to ask new questions across datasets, provide metrics on data quality and use, and integrate with external aggregators.
The document proposes a solution called iBioSearch that aims to provide a unified search interface for biologists to search over 1000 biological databases. It does this by first collecting interfaces from various biological databases and then reverse engineering them to generate a global schema (metamodel) that represents common search capabilities across interfaces. It maps each interface as an instance of this metamodel to extract search entities and criteria. It then clusters entities and consolidates criteria to generate a non-redundant global biological search interface (GBWS) for biologists. Future work involves testing this approach with biologists and expanding the methodology.
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...Dag Endresen
Regional NODES meeting of Europe 2010. Presentation of the Global Biodiversity Resources Discovery System (GBRDS, under development) for the NODES. How do we the NODES want the GBRDS to look like. What do we the NODES wish/need the GBRDS to be.
http://www.gbif.org/
http://gbrds.gbif.org/
http://code.google.com/p/gbif-registry/
The document discusses the CIARD (Coherence in Information for Agricultural Research for Development) initiative and how it aims to create a global infrastructure for linked open data. It describes how FAO has worked for decades to make agricultural information more accessible, including through programs like AGRIS and AIMS. The CIARD initiative now involves over 100 partners working to coordinate their efforts and promote common data formats and systems. It outlines FAO's work on vocabularies like AGROVOC and how linked open data can help link distributed data sources in agriculture through applying standards.
This document discusses developing an ontology-based semantic web application for the biological domain. It introduces the need for semantic technologies to help machines better understand and combine biological information from different sources. The document outlines the methodology, which involves defining concepts, properties, and relations in the biological domain to create an ontology. It also discusses implementing a semantic web application using the Jena framework to retrieve and manipulate biological data modeled with ontologies and RDF. The goal is to build a semantic search framework to improve information retrieval for biologists.
Publication and dissemination of datasets in taxonomy: ZooKeys working example
Lyubomir Penev, Terry Erwin, Jeremy Miller, Vishwas Chavan, Tom Moritz, Charles Griswold. ZooKeys 11: 1-8 (2009)
doi: 10.3897/zookeys.11.210
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
Discussion about ways of achieving FAIRness of both metadata and data. Brute force approaches, and more elegant "projection" approaches are shown.
Relevant papers are at:
doi: 10.7717/peerj-cs.110 (https://peerj.com/articles/cs-110/)
doi: 10.3389/fpls.2016.00641 (https://doi.org/10.3389/fpls.2016.00641)
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Penev, L et al. Publ Dissem Data Zookeys 06 01 09Tom Moritz
This document describes a concept for publishing and disseminating datasets in taxonomy that was applied in a paper by Miller et al. published in ZooKeys. Key aspects of the concept include: (1) publishing primary biodiversity data from the paper as a separate dataset with a DOI, (2) making the occurrence dataset available through GBIF simultaneously with publication, and (3) publishing the occurrence dataset as an interactive KML file in Google Earth with a separate DOI. This allows for indexing, aggregation, and reuse of the published data.
Scratchpads are virtual research environments that allow taxonomic and biodiversity data to be collected, curated, analyzed, published, and shared in a digital, open, and linked manner. They provide a seamless workflow for data by hosting websites for communities to enter and structure data using standardized modules. This facilitates dissemination of research through open access publishing of datasets, descriptions, keys, and more without reformatting. Major projects like e-Monocot demonstrate Scratchpads' ability to aggregate data from various sources into an integrated portal.
Text (personal views position statement) to accompany presentation on what research infrastructures really need for data, XLDB-Europe, 8-10th June 2011, Edinburgh
2 Discovery and Acquisition of Data1.pptxvijayapraba1
This document provides an outline of Lecture 2 from the course GEO 802, Data Information Literacy. It discusses various portals and repositories for publishing and finding data, including discipline-specific repositories, as well as directories and indexes of repositories. It also covers data journals and venues for publishing datasets to get them cited. Finally, it lists some exercises for students to find relevant data repositories in their fields and to explore search tools and open data portals.
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...James Nelson
The document describes a machine learning project that compares the performance of R packages for logistic regression and random forest algorithms on wine quality datasets. It loads and prepares the datasets, then explores the data through descriptive statistics. Logistic regression and random forest models are applied to the training data and evaluated on test data.
James Nelson has over 15 years of experience designing and leading laboratory, translational, and clinical research studies. He has a PhD in Molecular Biology and Genetics from Wayne State University and is currently enrolled in Indiana University's Data Science Master's Program. Nelson has extensive experience in areas such as statistical analysis, machine learning, big data technologies, bioinformatics, and data visualization. He has authored over 70 peer-reviewed publications and has been the recipient of over $7 million in NIH research grants.
More Related Content
Similar to IU Data Visualization Class Final Project: Visualizing Missing Species Interactions
Araport is an online resource for Arabidopsis and plant research that integrates various types of data from different sources. It provides genome annotation for Arabidopsis that has been validated and updated using RNA-seq data. Data is stored and can be accessed through the ThaleMine data warehouse. Araport also features a JBrowse genome viewer and Science Apps that retrieve real-time data through web services. It is an open source project that welcomes community contributions and holds workshops to support developers.
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
This document summarizes a presentation about the Natural History Museum's (NHM) efforts to create a centralized data portal and move towards a "Graph of Life" by connecting their collected data. It describes the large number and variety of objects in the NHM collections, efforts to digitize specimens, and challenges with previous disconnected digital access systems. The new NHM Data Portal aims to make data discovery and access easier through an open-source CKAN platform, providing over 3.7 million records and APIs. It discusses using the portal and linked open data approaches to ask new questions across datasets, provide metrics on data quality and use, and integrate with external aggregators.
The document proposes a solution called iBioSearch that aims to provide a unified search interface for biologists to search over 1000 biological databases. It does this by first collecting interfaces from various biological databases and then reverse engineering them to generate a global schema (metamodel) that represents common search capabilities across interfaces. It maps each interface as an instance of this metamodel to extract search entities and criteria. It then clusters entities and consolidates criteria to generate a non-redundant global biological search interface (GBWS) for biologists. Future work involves testing this approach with biologists and expanding the methodology.
GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...Dag Endresen
Regional NODES meeting of Europe 2010. Presentation of the Global Biodiversity Resources Discovery System (GBRDS, under development) for the NODES. How do we the NODES want the GBRDS to look like. What do we the NODES wish/need the GBRDS to be.
http://www.gbif.org/
http://gbrds.gbif.org/
http://code.google.com/p/gbif-registry/
The document discusses the CIARD (Coherence in Information for Agricultural Research for Development) initiative and how it aims to create a global infrastructure for linked open data. It describes how FAO has worked for decades to make agricultural information more accessible, including through programs like AGRIS and AIMS. The CIARD initiative now involves over 100 partners working to coordinate their efforts and promote common data formats and systems. It outlines FAO's work on vocabularies like AGROVOC and how linked open data can help link distributed data sources in agriculture through applying standards.
This document discusses developing an ontology-based semantic web application for the biological domain. It introduces the need for semantic technologies to help machines better understand and combine biological information from different sources. The document outlines the methodology, which involves defining concepts, properties, and relations in the biological domain to create an ontology. It also discusses implementing a semantic web application using the Jena framework to retrieve and manipulate biological data modeled with ontologies and RDF. The goal is to build a semantic search framework to improve information retrieval for biologists.
Publication and dissemination of datasets in taxonomy: ZooKeys working example
Lyubomir Penev, Terry Erwin, Jeremy Miller, Vishwas Chavan, Tom Moritz, Charles Griswold. ZooKeys 11: 1-8 (2009)
doi: 10.3897/zookeys.11.210
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
Discussion about ways of achieving FAIRness of both metadata and data. Brute force approaches, and more elegant "projection" approaches are shown.
Relevant papers are at:
doi: 10.7717/peerj-cs.110 (https://peerj.com/articles/cs-110/)
doi: 10.3389/fpls.2016.00641 (https://doi.org/10.3389/fpls.2016.00641)
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Penev, L et al. Publ Dissem Data Zookeys 06 01 09Tom Moritz
This document describes a concept for publishing and disseminating datasets in taxonomy that was applied in a paper by Miller et al. published in ZooKeys. Key aspects of the concept include: (1) publishing primary biodiversity data from the paper as a separate dataset with a DOI, (2) making the occurrence dataset available through GBIF simultaneously with publication, and (3) publishing the occurrence dataset as an interactive KML file in Google Earth with a separate DOI. This allows for indexing, aggregation, and reuse of the published data.
Scratchpads are virtual research environments that allow taxonomic and biodiversity data to be collected, curated, analyzed, published, and shared in a digital, open, and linked manner. They provide a seamless workflow for data by hosting websites for communities to enter and structure data using standardized modules. This facilitates dissemination of research through open access publishing of datasets, descriptions, keys, and more without reformatting. Major projects like e-Monocot demonstrate Scratchpads' ability to aggregate data from various sources into an integrated portal.
Text (personal views position statement) to accompany presentation on what research infrastructures really need for data, XLDB-Europe, 8-10th June 2011, Edinburgh
2 Discovery and Acquisition of Data1.pptxvijayapraba1
This document provides an outline of Lecture 2 from the course GEO 802, Data Information Literacy. It discusses various portals and repositories for publishing and finding data, including discipline-specific repositories, as well as directories and indexes of repositories. It also covers data journals and venues for publishing datasets to get them cited. Finally, it lists some exercises for students to find relevant data repositories in their fields and to explore search tools and open data portals.
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
Similar to IU Data Visualization Class Final Project: Visualizing Missing Species Interactions (20)
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...James Nelson
The document describes a machine learning project that compares the performance of R packages for logistic regression and random forest algorithms on wine quality datasets. It loads and prepares the datasets, then explores the data through descriptive statistics. Logistic regression and random forest models are applied to the training data and evaluated on test data.
James Nelson has over 15 years of experience designing and leading laboratory, translational, and clinical research studies. He has a PhD in Molecular Biology and Genetics from Wayne State University and is currently enrolled in Indiana University's Data Science Master's Program. Nelson has extensive experience in areas such as statistical analysis, machine learning, big data technologies, bioinformatics, and data visualization. He has authored over 70 peer-reviewed publications and has been the recipient of over $7 million in NIH research grants.
This proposal outlines the commercialization pathway for an investigational in vitro diagnostic (IVD) device for nonalcoholic fatty liver disease (NAFLD). They were unable to identify a substantially equivalent predicate device, so they plan to submit a formal pre-submission to the FDA to obtain guidance on the appropriate regulatory pathway. The proposed studies funded by this proposal would support information needed for the pre-submission, including analytical validation and performance characteristics of the test. Depending on FDA feedback, the pathway may involve de novo classification, reclassification, or premarket approval.
1) A study examined the effects of a high-fat diet and parenteral iron administration on non-alcoholic fatty liver disease (NAFLD) in an obese, diabetic mouse model. 2) Mice fed a high-fat diet and administered parenteral iron showed increased liver inflammation, oxidative stress, and collagen production compared to mice on only a high-fat diet or normal diet. 3) However, mice given both a high-fat diet and parenteral iron showed less fat accumulation in the liver (steatosis) than mice on only a high-fat diet.
This document summarizes a study that will compare the effects of omega-3 polyunsaturated fatty acid supplementation to monounsaturated fatty acid supplementation for 8 weeks on nonalcoholic fatty liver disease (NAFLD). It will randomize 30 patients with NAFLD and at least 20% steatosis into the two treatment groups. The primary outcome is reduction of intrahepatic fat content as measured by magnetic resonance spectroscopy. Secondary outcomes include changes in liver enzymes, lipid profile, inflammation markers, and insulin resistance. The study personnel, design, population, visit schedule, and treatment protocols are outlined.
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...James Nelson
The aim of this study is to investigate the effects of an 8-week dietary supplementation with omega-3 polyunsaturated fatty acids (PUFA; i.e., fish oil) compared to monounsaturated fatty acids (MUFA; i.e., safflower oil) on intrahepatic fat content measured by magnetic resonance spectroscopy, serum aminotransferases, fasting lipids, insulin resistance, resting metabolic rate and proinflammatory cytokines in patients with non-alcoholic fatty liver disease.
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...James Nelson
The goal of this study was to investigate if IL6 and IL1β cytokine SNPs, alone or in combination with HFE gene mutations, can affect the grade and pattern of hepatic iron deposition and serum iron markers in the well characterized NASH CRN cohort.
Serum Vitamin D Deficiency is Associated with NASH in AdultsJames Nelson
The aim of this study was to determine the relationship of serum vitamin D levels to histologic features of NAFLD, and associated demographic, clinical, and laboratory data in the well characterized NASH CRN cohort.
Twitter Dataset Analysis and Geocoding James Nelson
The aim of the project was to validate user-defined location data in a Twitter dataset of 10,000 tweets using MongoDB and the Google Maps Geocoding API.
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...James Nelson
Next-generation RNA sequencing has expedited the identification of new non-coding RNA species (ncRNAs), thus ushering in the emerging field of ncRNA biology. The goals of this study were to catalogue the spectrum of different ncRNAs in serum and liver of patients with NAFLD and to compare expression of serum exRNAs between NAFLD patients and healthy control subjects.
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver diseaseJames Nelson
Next- generation sequencing (NGS) was performed on 45 serum RNA samples using the Illumina HiScanSQ platform. The goal of this study was to determine serum miRNA profiles for use as novel diagnostic and prognostic biomarkers for the presence of NAFLD, NASH and advanced fibrosis.
This curriculum vitae summarizes the education and experience of James E. Nelson. He received a PhD in Molecular Biology and Genetics from Wayne State University in 1994. Since then, he has held several research and staff positions, primarily focused on nonalcoholic steatohepatitis (NASH). He has received over $5 million in grant funding and authored over 50 publications. He has also designed and conducted numerous clinical studies on NASH through the NASH Clinical Research Network.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
IU Data Visualization Class Final Project: Visualizing Missing Species Interactions
1. Team: Jim Nelson, Deepak Kher, Rama Raghava Reddy, Al Armstrong
INDIANA UNIVERSITY BLOOMINGTON
Visualizing Missing Species Interactions Data
Client Project – Information Visualization
2. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 1
I. Project Title- Visualizing Missing Species Interactions Data
II. Visualization Title –Global Visualization of Missing Species
Interaction
III. Team
Jim Nelson
Deepak Kher
Rama Raghava Reddy
Al Armstrong
IV. Visualization Goals & Importance of Project
Visualization Goals and Prototype
The aim of our project is to also utilize the GloBI APIs to visualize understudied organisms and
locations with minimal interaction data within the GloBI data repository. Please see the
snapshots of expected visualization product from our project.
4. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 3
Importance of Project
The human population is continually growing and encroaching upon traditional wildlife habitat.
At the same time over fishing, pollution and global warming are threatening marine ecosystems.
If worldwide conservation efforts are to succeed it is imperative that we fully understand the
interactions of biological networks across the globe. We hope that our project could become an
important research tool to define knowledge gaps within the hierarchy of interactions among
species worldwide.
V. Related Work
Global Biotic Interactions (GloBI) is an open, interactive and integrated species interaction data
service (1). The goal of GloBI is to provide an infrastructure to catalog all known interactions
among existing species. GloBI provides a means for researchers to combine their biotic
datasets using automated tools that normalize, aggregate and integrate various datasets into
structured repositories (a Neo4j database) using standardized vocabularies and ontologies (2).
Currently, GloBI has cataloged nearly 1.4 million species interactions among 149,676 different
taxa gleaned from over 18,000 studies (3).
As shown in the figure below (4), GloBI is part of a network of related organizations, websites
and other data providers working to catalog and provide access to biological data. Other web
services that directly integrated with the GloBI data include the Encyclopedia of Life (EOL),
sponsored by the Smithsonian Institution and the Gulf of Mexico Species Interactions
(GoMexSI) (5, 6). In addition, a number of published studies have utilized and cited the GloBI
datasets (7).
5. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 4
For the last two years GloBI
has served as one of the
IVMOOC client projects. In
2014, the IVMOOC team
created a food-web map by
overlaying the GloBI
interaction data with
terrestrial and marine
ecoregion geospatial data
(4, 8). To create the viz the
team utilized several R
packages along with
Cytoscape, and Adobe
Illustrator. Last year the
IVMOOC team created the
“GloBI Explorer”, a very
nice interactive web app
geared toward middle and
high school students (4, 9-
10). Using the GloBI APIs,
species thumbnail photos
and simplified network
visualizations the team was
able to create what should
be a very effective
educational resource to get
students interested in
biology and ecology.
VI. Data Statistics
Overview
The data to be used in this project was available in three formats on the GloBi GitHub repository
(2) including Darwin Core (csv format) (11), Turtle (rdf format) (12) and Neo4J (graphdb format)
(13). The data can also be accessed using software libraries (R and javascript) (14-15) or by
accessing the API directly (16). The datasets are recreated, normalized, integrated and
exported to the various data archives such as a neo4j graphdb, darwin core archive and
rdf/turtle archive, using Maven (17) as shown in the diagram below (2).
6. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 5
GloBI data normalization routine
We chose to utilize the data which was available as csv files in the Darwin Core Archive format
which is the standard for biodiversity informatics data such as this (11). Six separate csv files
were downloaded and extracted from a single tarball file from the GloBI Github repository. Here
is a summary of the main variables in each file:
occurence.csv
occurenceID = unique ID for each of the 1.4 million organism interactions
taxonID = organism ID
decimalLatitude
decimalLongitude
association.csv
occurenceID = as above
associationID = type of interaction (i.e., predator/prey, parasitic, pathogenic, etc)
bunch of other stuff we may or may not need
taxa.csv
taxonID
furtherInformationURL= web link to more info
scientificName = latin name
reference.csv
table showing authors and study citations for datasets
measurementOrFact.csv
7. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 6
table containing data related to different physical measurements obtained for taxon
taxonCache.csv
table containing phylogenetic hierarchy data (a.k.a., tree of life) including scientific and
common names
Data Extraction, Integration and Cleaning
We utilized a multi-pronged approach in order to extract, integration and clean the datasets.
Initially the occurrence, association and csv datasets were loaded into R, and using the R
packages dplyr (18) and tidyr (19) the occurrence.csv, association.csv and taxa.csv were joined
on the occurenceID and taxonID variables. Due to the non-uniform nature of many of the Darwin
Core variables such as occurenceID which combined multiple ID formats from the original
databases (i.e., Encylopedia of Life (EOL)(5), Global Biodiversity Information Facility
(GBIF)(20), Integrated Digitized Biocollections (iDigBio)(21), utilizing R for data cleaning
became time consuming and problematic. Ultimately these steps were performed through the
use of SQL. First the above csv files were loaded into SQL. The data of interest was then
extracted from the SQL database using the following three custom python scripts:
1. Taxon and Occurrence.py
For each taxonID in the occurrence file all the occurrences were extracted and stored in a Json
file.
Here Key is taxonID and value is list of [OccurenceID, decimallatitude, decimallongitude]
8. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 7
2. Occurrence and Association.py
For each occurrenceID stored in the above json file, all relevant data from the association.csv
file was retrieved such as association ID, TargetOccurrenceID, Association Type and reference
ID.
If there are any association details for that occurrence the above details will be stored, else if
occurrenceID is missing then a null value is stored for that column. These data were then
exported to another Json file.
Here Key is occurrenceID and values is list of [Association ID, TargetOccurrenceID, referenceID
and Association Type]
9. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 8
3. Final.py
The two Json files created from the above two files were merged and stored the final data in
final.csv
Dataset Description
Dataset description table
Total number of occurrences (interactions) 1,048,575
Number of unique occurrences 700,683
Number of unique taxon (different organisms) 108,345
Number of different taxon accounting for the
total number of unique occurrences
55,741
Percentage of taxon without interaction data 51%
Number of occurrences representing taxon
with multiple interactions
46,914
VII. Data Analysis/Visualization
10. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 9
Workflow
1. Data was extracted for all TaxonIDs with no associationID values from the merged datasets
using excel.
2. Data was sorted by TaxonIDs with highest number of no associations in the specific location.
3. Tableau was used to produce a geospatial visualization. Different colors in the map represent
different TaxonID with sizes of the circle indicating the number of no association records per
TaxonID in that specific location.
A) All values included
B) Cutoff of >50 applied
11. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 10
VIII. Discussion of Key Insights
From the initial analysis of our data and resulting visualizations it is clear that there is an
incredible lack of understanding how the vast majority of the organisms on earth interact. While
it is impressive to note that over a million species interactions have been cataloged in total, for
70% of these interactions that interaction is the only recorded interaction for at least one of the
two interacting taxa. Moreover, greater than half of all the organisms in the GloBi database
don’t have any data on even a single interaction.
From our initial visualization it appears that more research is being performed on the ecology of
the United States than other parts of the world. However, the increased data points mapping to
the US could also result from a greater participation in the GloBI project among those US
ecology researchers.
IX. Interim Analysis/Design Issues
Several issues surfaced after discussing our initial visualizations with our client Mr. Jorrit
Poelen. The main issue with our original visualization was related to the fact that multiple
common names and ID numbers (depending on which original database the data came from)
coexist in the datasets. Moreover, these discrepancies are not uniform across the original csv
files sharing common variable names. Rama also discovered a related issue that many taxonID
values in the taxa.csv file are not used on occurrence.csv. Mr. Poelen was unaware of this issue
and has listed this as a pending issue to be addressed in the GloBi GitHub (22). For these
reasons our original visualization overestimated the number of taxon with missing interaction
data. We are performing additional data cleaning to resolve these issues prior to creating our
final visualizations.
Our original design focused on geospatial visualization of taxon with missing/few interaction
data at the species level. Mr. Poelen also discussed that it would be very valuable to expand our
approach to include visualization of missing data at higher phylogenetic ranks such as at the
level of family, order or class. We are currently modifying our datasets and investigating
potential network visualization methods that may be appropriate to achieve these revised goals.
X. Challenges and Opportunities
There are many challenges we faced thus far in this project, chief among these is the
complexity and non-uniformity of the data, including many variables with combined string and
numeric values and multiple ID values associated with each of the >19 data sources. After
several discussion with the client, due to time constraints and for the sake of simplicity, we have
revised our original plan to focus on just the data from the largest data source; the Integrated
Taxonomic Information System (23).
It was our original aim to create a tool that biologists could utilize to better understand which
organism and ecosystems are understudied throughout the world. Given the valuable input from
our discussions with our client Mr. Poelen, we are confident that after several modifications to
our study design, we will still be able to produce a visualization(s) that will convey this important
information in an informative and compelling manner.
12. Visualizing Missing Species Interactions Data
Final Project IU IVMOOC 2016 11
References
1. Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic
Interactions: An open infrastructure to share and analyze species-interaction datasets.
Ecological Informatics.http://dx.doi.org/10.1016/j.ecoinf.2014.08.005 (Links to an
external site.)
2. https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data
3. http://www.globalbioticinteractions.org/references.html
4. http://blog.globalbioticinteractions.org/
5. http://eol.org/
6. http://gomexsi.tamucc.edu/
7. http://www.globalbioticinteractions.org/about.html
8. Slyusarev, Sergey; Kontopoulos, Dimitrios-Georgios; Taysom, William; Guzman, Adrian;
Wadhwa, Bimlesh (2015): Global Biotic Interactions food web map.
https://figshare.com/articles/Global_Biotic_Interactions_food_web_map/1297762
9. http://danielabar.github.io/globi-proto/#/landing
10. https://figshare.com/articles/GloBI_Explorer_Interactive_Ecosystem_Explorer/1414253/1
11. https://en.wikipedia.org/wiki/Darwin_Core_Archive
12. https://www.w3.org/TeamSubmission/turtle/
13. http://neo4j.com/
14. https://cran.r-project.org/web/packages/rglobi/
15. https://www.npmjs.com/package/globi-data
16. https://github.com/jhpoelen/eol-globi-data/wiki/API
17. https://maven.apache.org/guides/introduction/introduction-to-repositories.html
18. https://cran.r-project.org/web/packages/dplyr/
19. https://cran.r-project.org/web/packages/tidyr/index.html
20. http://www.gbif.org/
21. https://www.idigbio.org/
22. https://github.com/jhpoelen/eol-globi-data/issues/220
23. http://www.itis.gov/
Colleagues