A talk given at iEvobio11, a conference about Informatics for Phylogenetics, Biodiversity and Evolutionary Biology, held in Norman, Oklahoma June 21-22, 2011
How the Encyclopedia of Life is wrangling organismal attribute dataCyndy Parr
The document discusses the Encyclopedia of Life's (EOL) efforts to aggregate and standardize organism attribute data from various sources. Some key points:
- EOL harvests data from over 240 content providers and hosts over 1.1 million species pages. It receives over 3.3 million annual visitors from 235 countries.
- EOL is developing a TraitBank to aggregate trait data from various datasets, totaling over 128,000 data points for over 20,000 taxa so far. It aims to make this data easily accessible and analyzable.
- Challenges include standardizing data from different sources and filling gaps, but aggregated trait data could help answer questions about topics like species interactions, tissue
Australian Mangrove and Saltmarsh Resource - A model for ecosystem-based, onl...Emma Clifton
The document summarizes an online resource for information about Australian mangrove and saltmarsh ecosystems. The resource collates species, ecology, and human interaction data on a wiki platform in an effort to provide a freely available and updatable community information hub. It includes lists of over 1300 species as well as species profiles, habitat associations, identification keys, locality profiles, and a bibliography with over 6000 entries.
BioSharing - RDA Plenary 6 - Metadata Standards Catalog WG and BioSharing WG ...Peter McQuilton
An introduction to the metadata landscape in the life sciences. Covering metadata standards, the databases that implement them and the policies that endorse/recommend both standards and databases in the life sciences.
This document summarizes the goals and progress of the Open Tree of Life project, which aims to synthesize a complete draft tree of life using existing phylogenetic data. The project has collected phylogenetic data from over 7000 studies and stored it using graph databases. An open public interface allows users to browse, download and query the tree. The project is on track to release an initial draft tree in year 1 and refine it based on user feedback in year 2, while expanding collaborations and incentives for data contributors.
Presentation by Simon Mayo at the KikForum
Abstract:
As part of the CATE project we are developing keys in Lucid 3 to the genera of Araceae, and to the genus Anthurium (ca. 800spp.), Arum and Philodendron. The key to Arum is already online. These keys will be incorporated into a web-based taxonomic revision of the Araceae family as the plant model group for the project. Anthurium presents a particular challenge as it is a very large and difficult genus, within which it is currently nearly impossible for non-specialists to determine plants to species. We hope the key will go some way to solving this problem.
The document provides an overview of the Encyclopedia of Life (EOL) including its goals, typical content, statistics, and role of fellows. EOL aims to create a page for every known species combining scientific information with public enthusiasm. It currently has over 1.9 million pages from 60 partners and seeks to grow further as new species are discovered. The EOL Fellows program supports early career scientists to engage with the project and help improve content.
How the Encyclopedia of Life is wrangling organismal attribute dataCyndy Parr
The document discusses the Encyclopedia of Life's (EOL) efforts to aggregate and standardize organism attribute data from various sources. Some key points:
- EOL harvests data from over 240 content providers and hosts over 1.1 million species pages. It receives over 3.3 million annual visitors from 235 countries.
- EOL is developing a TraitBank to aggregate trait data from various datasets, totaling over 128,000 data points for over 20,000 taxa so far. It aims to make this data easily accessible and analyzable.
- Challenges include standardizing data from different sources and filling gaps, but aggregated trait data could help answer questions about topics like species interactions, tissue
Australian Mangrove and Saltmarsh Resource - A model for ecosystem-based, onl...Emma Clifton
The document summarizes an online resource for information about Australian mangrove and saltmarsh ecosystems. The resource collates species, ecology, and human interaction data on a wiki platform in an effort to provide a freely available and updatable community information hub. It includes lists of over 1300 species as well as species profiles, habitat associations, identification keys, locality profiles, and a bibliography with over 6000 entries.
BioSharing - RDA Plenary 6 - Metadata Standards Catalog WG and BioSharing WG ...Peter McQuilton
An introduction to the metadata landscape in the life sciences. Covering metadata standards, the databases that implement them and the policies that endorse/recommend both standards and databases in the life sciences.
This document summarizes the goals and progress of the Open Tree of Life project, which aims to synthesize a complete draft tree of life using existing phylogenetic data. The project has collected phylogenetic data from over 7000 studies and stored it using graph databases. An open public interface allows users to browse, download and query the tree. The project is on track to release an initial draft tree in year 1 and refine it based on user feedback in year 2, while expanding collaborations and incentives for data contributors.
Presentation by Simon Mayo at the KikForum
Abstract:
As part of the CATE project we are developing keys in Lucid 3 to the genera of Araceae, and to the genus Anthurium (ca. 800spp.), Arum and Philodendron. The key to Arum is already online. These keys will be incorporated into a web-based taxonomic revision of the Araceae family as the plant model group for the project. Anthurium presents a particular challenge as it is a very large and difficult genus, within which it is currently nearly impossible for non-specialists to determine plants to species. We hope the key will go some way to solving this problem.
The document provides an overview of the Encyclopedia of Life (EOL) including its goals, typical content, statistics, and role of fellows. EOL aims to create a page for every known species combining scientific information with public enthusiasm. It currently has over 1.9 million pages from 60 partners and seeks to grow further as new species are discovered. The EOL Fellows program supports early career scientists to engage with the project and help improve content.
SHARE is a collaborative initiative to improve access to and preservation of research outputs. It includes several interlocking components like a notification service, registry, and discovery tools. The notification service harvests information on new research releases from repositories and sources. It has provided over 40,000 reports on items like articles, datasets, and preprints. Challenges include varied source platforms, lack of standard metadata, and siloed systems. SHARE aims to address these through open standards, APIs, and new services to streamline the research process and outputs.
evaluating the quality of open access contentBrian Bot
Sage Bionetworks is a non-profit organization that supports open and collaborative biomedical research. It operates Synapse, a platform that enables large-scale collaboration by allowing researchers to share digital assets and communicate throughout the research process. Synapse has supported several large collaborations across disciplines, including the DREAM Challenges and TCGA Pan-Cancer Consortium, facilitating collaboration around common questions or data.
This document discusses two case studies linking biodiversity data from existing online resources:
1) Using the Encyclopedia of Life and Global Names Recognition and Discovery to capture species interactions from text and create a "digital ecosystem".
2) Linking phenotype data from Phenoscape and TraitBank to taxon location data from GBIF and environmental data from Map of Life to associate phenotypes with habitats. Both workflows are able to programmatically extract and integrate biodiversity data from multiple sources to generate new knowledge.
Data dialogue - Human Genomic Data DiscoveryFiona Nielsen
Presenting at The Data Dialogue. Time to Share: Navigating Boundaries & Benefits - Afternoon session: Sharing difficult data.
July 28 - 2016 @ University of Cambridge
http://www.ses.ac.uk/event/data-dialogue-time-share-navigating-boundaries-benefits/
In this talk I present an overview of human genomic data sources around the world, their funding, access policies and type of data they contain. Discussing why data sharing is hard, including issues of data privacy and a research culture that does not incentivise sharing of data and results.
Presented by Fiona Nielsen, founder and CEO of Repositive
http://repositive.io
Nigel Robinson - ZooBank and Zoological Record: a partnership for successICZN
The document discusses the partnership between Zoological Record (ZR) and ZooBank to provide taxonomic and nomenclatural information. ZR is the oldest continuing life science reference database, indexing over 72,000 items annually from 5000 journals. It captures metadata on new species names and has an archive of over 1.7 million articles. Through the Index to Organism Names project, ZR provides links from species names to relevant information. The partnership with ZooBank will provide stability, technology, and open access to taxonomic data and literature links for the scientific community.
Biodiversity informatics describes integrating biological research, computational science, and software engineering to deal with biotic data. Biodiversity data is used for taxonomy, biogeography, ecology, conservation and more. Data is collected, standardized, digitized and published using Darwin Core and made available through organizations like GBIF. Key challenges include dealing with synonyms and standardizing data across sources.
This document discusses the need to improve knowledge of transposable element (TE) content in genomes through standardized annotation methods, updating TE classification systems, and increasing collaboration between TE databases and researchers. It notes that over half of published genome sequences have incomplete TE annotation and classification. Different research groups use various TE classification approaches that do not fully reflect the diversity of mobile DNA. While several TE databases and tools exist, they generally operate independently with little interconnectivity or standardization. The document calls for increased discussion within the TE research community to develop solutions, such as re-annotating genomes, expanding database integration, and potentially forming an international society to establish standards for TE research.
The Road to TraitBank: What's Next for the Encyclopedia of LifeCyndy Parr
The document discusses plans to expand the Encyclopedia of Life (EOL) database to include a new "TraitBank" that will store trait data for millions of species. Currently, EOL contains basic information pages for over 1 million species but lacks details on species traits. The first step is adding limited trait data to EOL pages through a new funding initiative focused on marine species. The long term goal is to create a larger TraitBank database that can handle vast amounts of trait data, promote best practices, and enable crowd-sourcing contributions to facilitate research. By linking trait information to species on EOL, it will become a more powerful open resource for studying biodiversity.
Citizen Science: Association of American Medical Colleges conferenceDarlene Cavalier
This document discusses how SciStarter connects regular people to real science projects they can participate in as citizen scientists. It notes that millions enjoy science but thousands of scientists need volunteers, and SciStarter helps connect them. Examples are given of large citizen science projects in fields like astronomy, environmental monitoring, and health. The document promotes SciStarter's role in organizing these projects, matching volunteers with researchers, and helping to scale up citizen science.
The document announces an international conference on integrative biology to be held August 5-7, 2013 in Las Vegas, USA. The conference will bring together scientists from academia and industry to network and facilitate collaboration between computational biology, bioinformatics, and other fields to achieve common goals. It will include presentations, workshops, and exhibits to explore integrative approaches and recent advances in areas like bioinformatics, systems biology, and clinical data analysis.
In a speech for the Global Health Program at the Council on Foreign Relations in New York City, Calit2 director Larry Smarr addresses the issue of biological diversity and the importance of monitoring the microbiome.
Sleeping Beauty Transposon: Awakening a new approach to cancer treatmentJulie Kendrick
13 million years in the making, Perry Hackett’s Sleeping Beauty transposon has far-reaching implications for identifying causes of disease, use in gene therapy and more. His Sleeping Beauty (SB) transposon, reconstructed from a fish DNA sequence that went extinct 13 million years ago, proved to be a gamechanger in non-viral cancer gene therapy.
Using the Semantic Web to Support Ecoinformaticsebiquity
We describe our on-going work in using the semantic web in support of ecological informatics, and demonstrate a distributed platform for constructing end-to-end use cases. Specifically, we describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which allows scientists to semi-automatically construct distributed datasets relevant to the queries they want to ask. ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
A summary of the research work of Jean-Claude Bradley at Drexel University September 2007. A few slides on the CombiUgi project and malaria then some screenshots of the Open Notebook Science project UsefulChem.
Keynote Speaker 1 - Data Intensive Challenges in Biodiversity Conservation: a...TERN Australia
1) The document discusses using massive volumes of biodiversity data from sources like eBird to build species distribution models through data-driven techniques.
2) eBird gathers bird observation data from citizen scientists and uses this crowdsourced data along with review processes to build databases on species distributions.
3) Models like SpatioTemporal Exploratory Models (STEM) are used to predict species distributions across multiple scales by differentiating between local and global patterns and accounting for non-stationarity in species-habitat associations over space and time.
This document discusses two upcoming events from OpenVirus and CEVOpen focusing on plant science literature. Part 1 will have a group of young Indian scientists discussing how to mine scholarly literature to find hidden science. Part 2 will be a fun game designed by Gita Yadav's research group to have non-science colleagues collect and analyze papers on plants, countries, and chemicals from a database of 200,000 open articles. The game aims to educate while being collaborative or competitive. The document also provides context on the organizer's previous work developing literature mining tools and collaborations on plant science research.
1) Molecular Cancer is an open access journal that aims to maximize the exchange of scientific information by making all of its content freely available.
2) Open access has several broad benefits including universal accessibility of articles online, copyright retention by authors, and permanent archiving of articles which can increase citations and dissemination.
3) Molecular Cancer accepts articles through a peer review process and publishes them online along with supporting materials, allowing for fast publication and wider dissemination of research.
WikiGenomes and Chlambase provide a semantic data model in Wikidata for integrating microbial genomics data. They model taxonomic and gene/protein relationships as Wikidata items and properties. A Python bot gathers data from sources and writes it to Wikidata. WikiGenomes serves as a centralized database for sequenced microbial genomes. It provides gene reports and supports community annotation directly in Wikidata. This engages domain experts and facilitates data access.
Philippe Rocca-Serra presented on MI-compliant ISA configurations at the MIBBI workshop. ISAcreator is a tool that uses configurations to enforce minimum reporting standards, such as MIAPE, MIGS, and MIAME, for different data types. The talk demonstrated how to build configurations based on checklists like MIENS, and use them in ISAcreator to standardize data collection and structured reports for various experiments, including next generation sequencing. Future work includes integrating more standards and making the configurations widely available.
The document discusses using the software R to access and analyze phylogenetic tree data from the online repository TreeBASE. It describes how R can be used to pull tree data from published studies in TreeBASE, repeat analyses such as diversification rate calculations, and compare results to newer methods or similar trees. The goal is to create tools that allow the research community to easily update meta-analyses when new tree data becomes available.
The document outlines guidelines called GIATE for standardizing the reporting of information about therapy experiments across multiple levels from molecular to clinical. It describes the scope of GIATE in linking different experimental levels, resources, and standards. It also provides details on the initial implementation phase of adopting GIATE checklists and classification schemes, as well as the contributors, funding sources, and status of GIATE.
SHARE is a collaborative initiative to improve access to and preservation of research outputs. It includes several interlocking components like a notification service, registry, and discovery tools. The notification service harvests information on new research releases from repositories and sources. It has provided over 40,000 reports on items like articles, datasets, and preprints. Challenges include varied source platforms, lack of standard metadata, and siloed systems. SHARE aims to address these through open standards, APIs, and new services to streamline the research process and outputs.
evaluating the quality of open access contentBrian Bot
Sage Bionetworks is a non-profit organization that supports open and collaborative biomedical research. It operates Synapse, a platform that enables large-scale collaboration by allowing researchers to share digital assets and communicate throughout the research process. Synapse has supported several large collaborations across disciplines, including the DREAM Challenges and TCGA Pan-Cancer Consortium, facilitating collaboration around common questions or data.
This document discusses two case studies linking biodiversity data from existing online resources:
1) Using the Encyclopedia of Life and Global Names Recognition and Discovery to capture species interactions from text and create a "digital ecosystem".
2) Linking phenotype data from Phenoscape and TraitBank to taxon location data from GBIF and environmental data from Map of Life to associate phenotypes with habitats. Both workflows are able to programmatically extract and integrate biodiversity data from multiple sources to generate new knowledge.
Data dialogue - Human Genomic Data DiscoveryFiona Nielsen
Presenting at The Data Dialogue. Time to Share: Navigating Boundaries & Benefits - Afternoon session: Sharing difficult data.
July 28 - 2016 @ University of Cambridge
http://www.ses.ac.uk/event/data-dialogue-time-share-navigating-boundaries-benefits/
In this talk I present an overview of human genomic data sources around the world, their funding, access policies and type of data they contain. Discussing why data sharing is hard, including issues of data privacy and a research culture that does not incentivise sharing of data and results.
Presented by Fiona Nielsen, founder and CEO of Repositive
http://repositive.io
Nigel Robinson - ZooBank and Zoological Record: a partnership for successICZN
The document discusses the partnership between Zoological Record (ZR) and ZooBank to provide taxonomic and nomenclatural information. ZR is the oldest continuing life science reference database, indexing over 72,000 items annually from 5000 journals. It captures metadata on new species names and has an archive of over 1.7 million articles. Through the Index to Organism Names project, ZR provides links from species names to relevant information. The partnership with ZooBank will provide stability, technology, and open access to taxonomic data and literature links for the scientific community.
Biodiversity informatics describes integrating biological research, computational science, and software engineering to deal with biotic data. Biodiversity data is used for taxonomy, biogeography, ecology, conservation and more. Data is collected, standardized, digitized and published using Darwin Core and made available through organizations like GBIF. Key challenges include dealing with synonyms and standardizing data across sources.
This document discusses the need to improve knowledge of transposable element (TE) content in genomes through standardized annotation methods, updating TE classification systems, and increasing collaboration between TE databases and researchers. It notes that over half of published genome sequences have incomplete TE annotation and classification. Different research groups use various TE classification approaches that do not fully reflect the diversity of mobile DNA. While several TE databases and tools exist, they generally operate independently with little interconnectivity or standardization. The document calls for increased discussion within the TE research community to develop solutions, such as re-annotating genomes, expanding database integration, and potentially forming an international society to establish standards for TE research.
The Road to TraitBank: What's Next for the Encyclopedia of LifeCyndy Parr
The document discusses plans to expand the Encyclopedia of Life (EOL) database to include a new "TraitBank" that will store trait data for millions of species. Currently, EOL contains basic information pages for over 1 million species but lacks details on species traits. The first step is adding limited trait data to EOL pages through a new funding initiative focused on marine species. The long term goal is to create a larger TraitBank database that can handle vast amounts of trait data, promote best practices, and enable crowd-sourcing contributions to facilitate research. By linking trait information to species on EOL, it will become a more powerful open resource for studying biodiversity.
Citizen Science: Association of American Medical Colleges conferenceDarlene Cavalier
This document discusses how SciStarter connects regular people to real science projects they can participate in as citizen scientists. It notes that millions enjoy science but thousands of scientists need volunteers, and SciStarter helps connect them. Examples are given of large citizen science projects in fields like astronomy, environmental monitoring, and health. The document promotes SciStarter's role in organizing these projects, matching volunteers with researchers, and helping to scale up citizen science.
The document announces an international conference on integrative biology to be held August 5-7, 2013 in Las Vegas, USA. The conference will bring together scientists from academia and industry to network and facilitate collaboration between computational biology, bioinformatics, and other fields to achieve common goals. It will include presentations, workshops, and exhibits to explore integrative approaches and recent advances in areas like bioinformatics, systems biology, and clinical data analysis.
In a speech for the Global Health Program at the Council on Foreign Relations in New York City, Calit2 director Larry Smarr addresses the issue of biological diversity and the importance of monitoring the microbiome.
Sleeping Beauty Transposon: Awakening a new approach to cancer treatmentJulie Kendrick
13 million years in the making, Perry Hackett’s Sleeping Beauty transposon has far-reaching implications for identifying causes of disease, use in gene therapy and more. His Sleeping Beauty (SB) transposon, reconstructed from a fish DNA sequence that went extinct 13 million years ago, proved to be a gamechanger in non-viral cancer gene therapy.
Using the Semantic Web to Support Ecoinformaticsebiquity
We describe our on-going work in using the semantic web in support of ecological informatics, and demonstrate a distributed platform for constructing end-to-end use cases. Specifically, we describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which allows scientists to semi-automatically construct distributed datasets relevant to the queries they want to ask. ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
A summary of the research work of Jean-Claude Bradley at Drexel University September 2007. A few slides on the CombiUgi project and malaria then some screenshots of the Open Notebook Science project UsefulChem.
Keynote Speaker 1 - Data Intensive Challenges in Biodiversity Conservation: a...TERN Australia
1) The document discusses using massive volumes of biodiversity data from sources like eBird to build species distribution models through data-driven techniques.
2) eBird gathers bird observation data from citizen scientists and uses this crowdsourced data along with review processes to build databases on species distributions.
3) Models like SpatioTemporal Exploratory Models (STEM) are used to predict species distributions across multiple scales by differentiating between local and global patterns and accounting for non-stationarity in species-habitat associations over space and time.
This document discusses two upcoming events from OpenVirus and CEVOpen focusing on plant science literature. Part 1 will have a group of young Indian scientists discussing how to mine scholarly literature to find hidden science. Part 2 will be a fun game designed by Gita Yadav's research group to have non-science colleagues collect and analyze papers on plants, countries, and chemicals from a database of 200,000 open articles. The game aims to educate while being collaborative or competitive. The document also provides context on the organizer's previous work developing literature mining tools and collaborations on plant science research.
1) Molecular Cancer is an open access journal that aims to maximize the exchange of scientific information by making all of its content freely available.
2) Open access has several broad benefits including universal accessibility of articles online, copyright retention by authors, and permanent archiving of articles which can increase citations and dissemination.
3) Molecular Cancer accepts articles through a peer review process and publishes them online along with supporting materials, allowing for fast publication and wider dissemination of research.
WikiGenomes and Chlambase provide a semantic data model in Wikidata for integrating microbial genomics data. They model taxonomic and gene/protein relationships as Wikidata items and properties. A Python bot gathers data from sources and writes it to Wikidata. WikiGenomes serves as a centralized database for sequenced microbial genomes. It provides gene reports and supports community annotation directly in Wikidata. This engages domain experts and facilitates data access.
Philippe Rocca-Serra presented on MI-compliant ISA configurations at the MIBBI workshop. ISAcreator is a tool that uses configurations to enforce minimum reporting standards, such as MIAPE, MIGS, and MIAME, for different data types. The talk demonstrated how to build configurations based on checklists like MIENS, and use them in ISAcreator to standardize data collection and structured reports for various experiments, including next generation sequencing. Future work includes integrating more standards and making the configurations widely available.
The document discusses using the software R to access and analyze phylogenetic tree data from the online repository TreeBASE. It describes how R can be used to pull tree data from published studies in TreeBASE, repeat analyses such as diversification rate calculations, and compare results to newer methods or similar trees. The goal is to create tools that allow the research community to easily update meta-analyses when new tree data becomes available.
The document outlines guidelines called GIATE for standardizing the reporting of information about therapy experiments across multiple levels from molecular to clinical. It describes the scope of GIATE in linking different experimental levels, resources, and standards. It also provides details on the initial implementation phase of adopting GIATE checklists and classification schemes, as well as the contributors, funding sources, and status of GIATE.
Susanna-Assunta Sansone presented at the International Conference on Systems Biology on standards to enable sharing of experimental data and metadata. Three types of standards are needed: minimum reporting checklists, controlled vocabularies and terminologies, and data exchange formats. Journals, biocurators, and funders are developing these standards to support comprehensible, reusable, and reproducible research. However, navigating the various standards can be challenging, and communication is needed between standards groups and other stakeholders.
The document discusses the importance of data standards and reporting standards for enabling data sharing and reproducible research in omics studies. It notes that three types of standards - minimum reporting requirements, semantics like nomenclatures and terminologies, and data formats - allow unambiguous representation and communication of experimental information. Many efforts are working on developing such standards, but the field is complex with a wide variety of standards being developed by different groups. Better coordination is needed between these efforts to help researchers navigate the "sea of standards" and determine which ones to use.
The document discusses the growing role of data sharing communities and standards in genomics research. It outlines key paradigm shifts from studying individual genomes to pangenomes and from culturable to unculturable organisms. It highlights the increasing data volumes from large sequencing projects and metagenomics. It emphasizes that community agreement on data stewardship and standards is needed to fully leverage these genomic data resources through the work of groups like the Genomic Standards Consortium.
The TNRS: a Taxonomic Name Resolution Service for PlantsNaim Matasci
The document discusses the Taxonomic Name Resolution Service (TNRS) which is a tool that standardizes plant names. It resolves issues like misspellings, outdated names, and synonyms to map names to currently accepted taxonomic names. The TNRS is open source software available on GitHub and via web services and APIs documented on tnrs.iplantcollaborative.org to help unify plant names across data sources.
The document discusses the Encyclopedia of Life (EOL) project, which aims to create a web page for every known species containing key information about it. It outlines EOL's goals of aggregating biodiversity data from various sources and making it openly accessible online. The document also describes EOL's efforts to establish a taxonomic framework and infrastructure to facilitate collaborative curation of species pages.
Writing The Encyclopedia Of Life (not EoL.org)Vince Smith
The document discusses the goal of comprehensively inventorying and documenting Earth's biodiversity through the Encyclopedia of Life project. It notes that while about 1.8 million species have been described, the total number is estimated to be between 10-30 million. The challenges discussed include integrating fragmented data from different sources and communities, addressing issues around incentives, politics, and licensing to encourage global collaboration on the project. Technical challenges involve developing standards, platforms, and web services to aggregate and semantically link biodiversity data at large scale.
Developing data services: a tale from two Oregon universitiesAmanda Whitmire
While the generation or collection of large, complex research datasets is becoming easier and less expensive all the time, researchers often lack the knowledge and skills that are necessary to properly manage them. Having these skills is paramount in ensuring data quality, integrity, discoverability, integration, reproducibility, and reuse over time. Librarians have been preserving, managing and disseminating information for thousands of years. As scholarly research is increasingly carried out digitally, and products of research have expanded from primarily text-based manuscripts to include datasets, metadata, maps, software code etc., it is a natural expansion of scope for libraries to be involved in the stewardship of these materials as well. This kind of evolution requires that libraries bring in faculty with new skills and collaborate more intimately with researchers during the research data lifecycle, and this is exactly what is happening in academic libraries across the country. In this webinar, two researchers-turned-data-specialists, both based in academic libraries, will share their experiences and perspectives on the development of research data services at their respective institutions. Each will share their perspective on the important role that libraries can play in helping researchers manage, preserve, and share their data.
Encyclopedia of Life: Use cases for phenotypesCyndy Parr
EOL aggregates and curates scientific data from multiple sources to provide comprehensive summaries of taxa. It has grown from 2.8 million pages and 2 million data objects two years ago to 3.3 million pages and over 5 million data objects today. EOL is working to improve semantic search, link data to external resources, promote text mining and crowdsourcing, and provide analyzable data summaries to enable new types of research across the tree of life.
Beacon Network: A System for Global Genomic Data SharingMiro Cupak
The Beacon Network provides a system for global genomic data sharing by allowing users to query a network of genetic data sources to determine if a particular genetic variant or mutation exists in their databases. It began as a web service called Beacon that responds with "yes" or "no" to questions about genetic mutations. The Beacon Network expands this by distributing queries across multiple beacons and aggregating the results. It currently includes over 25 genomic organizations with access to over 2 million samples and 2 billion genetic variants, serving hundreds of thousands of queries from users around the world. The goal is to facilitate discovery of new links between genetic data and health conditions.
A talk given at the Semantic Reasoning workshop held at the National Museum of Natural History September 6, 2012. The audience included computer scientists and biological scientists interested in using EOL for their research.
The document provides an overview of the Research Data Alliance (RDA). Some key points:
- RDA builds social and technical bridges to enable open sharing of data across technologies, disciplines, and countries. It has over 3,700 members from 110 countries.
- RDA has 65+ working and interest groups that create standards, best practices, and other resources in 12-18 months to accelerate data sharing. This includes work on data citation principles, agriculture data, and more.
- RDA plays a role in connecting data initiatives at multiple scales from local to global. National groups support local participation in RDA to amplify effects for both national and international communities.
This document discusses challenges and opportunities for discovering and documenting biodiversity in the current information age. It argues that current taxonomic processes are too slow and that new approaches are needed to integrate distributed data sources and leverage community contributions. Specifically, it proposes:
1) Publishing new biodiversity data prior to formal documentation to accelerate discovery.
2) Developing automated workflows and online workspaces to integrate phylogenetic, distribution, and trait data.
3) Enabling community participation through open data sharing and collaborative annotation platforms.
This document discusses challenges and opportunities for discovering and documenting biodiversity in the current information age. It argues that current taxonomic processes are too slow and that new approaches are needed to integrate distributed data sources and leverage community sourcing. Specifically, it advocates for:
1) Publishing new biodiversity data prior to formal documentation to accelerate discovery.
2) Developing automated workflows and online workspaces to integrate phylogenetic, distribution, and trait data.
3) Enabling community participation in annotating and improving global biodiversity models and maps.
4) Changing incentives to value data sharing over individual "kudos" and prioritize the collective good of the scientific community.
Global patterns of insect diiversity, distribution and evolutionary distinctnessAlison Specht
The presentation of the CESAB group ACTIAS at the 2016 french ecology conference in the FRB-CESAB session "Using a treasury of knowledge to tackle complex ecological questions." Presenter: Carlos Lopez-Vaamonde
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK Cyndy Parr
Talk presented at iEvoBio 2014 conference in Raleigh, North Carolina. Though there's a similar title and overlap with the talk I posted last week, there is new material here especially geared towards an informatics crowd savvy in the tools and technology.
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
"Impacto de la Informática en el Conocimiento de la Biodiversidad: Actualidad y Futuro” at Universidad Nacional de Colombia on August 12, 2011. https://sites.google.com/site/simposioinformaticaicn/home
The document discusses the rise of big data in microbiology due to decreasing costs of DNA sequencing and computational resources. It describes how high-throughput sequencing is generating vast amounts of microbial genomic and metagenomic data. However, analyzing these large, complex datasets presents numerous technical and social challenges for microbiologists, including handling data volume, integrating diverse data types, accessing resources, and incentivizing data sharing. Overcoming these bottlenecks will be key to unlocking the scientific insights contained within the microbial "big data" tidal wave.
Frontiers of discovery with Encyclopedia of LifeCyndy Parr
Presented at the National Museum of Natural History, Smithsonian Institution 18 June 2014
Describes, among other things, development of the TraitBank repository of species attributes, and the use of EOL and TraitBank in scientific research.
Scratchpads are virtual research environments that allow taxonomic and biodiversity data to be collected, curated, analyzed, published, and shared in a digital, open, and linked manner. They provide a seamless workflow for data by hosting websites for communities to enter and structure data using standardized modules. This facilitates dissemination of research through open access publishing of datasets, descriptions, keys, and more without reformatting. Major projects like e-Monocot demonstrate Scratchpads' ability to aggregate data from various sources into an integrated portal.
The document summarizes the goals and activities of the Biodiversity Heritage Library (BHL) project, which aims to digitize the published literature on biodiversity and make it openly accessible online. It discusses BHL's partnerships with other organizations like the Encyclopedia of Life to aggregate content. It also provides details on BHL's scanning operations and efforts to engage international partners to expand its global coverage of literature.
AB3ACBS 2016: EMBL Australia Bioinformatics ResourcePhilippa Griffin
The EMBL Australian Bioinformatics Resource (EMBL-ABR) is a distributed national research infrastructure that provides bioinformatics support to life science researchers in Australia. It has a hub-and-nodes structure with the hub hosted at the Victorian Life Science Computation Initiative at the University of Melbourne and 10 nodes located across Australian institutions. EMBL-ABR aims to increase Australia's capacity for bioinformatics research and data science, provide training in bioinformatics, and enable participation in international collaborations.
The Open Tree of Life project aims to create a complete and freely accessible digital tree of life drawing from published phylogenetic studies and taxonomies. It has synthesized data from over 4,800 phylogenetic trees representing over 2,300 studies and 2.6 million taxonomic names. The synthesis process involves filtering and combining source trees from published studies into a single consensus tree using a graph database, while taxonomies provide taxon coverage. The resulting public tree of life is being refined with user feedback and new data. Future goals include improving the draft tree, developing synthesis methods, and adding user features like comparing trees and on-demand synthesis.
Similar to The emerging biodiversity data ecosystem (20)
Webinar presentation by Cyndy Parr and Erin Antognoli hosted by Hunger Solutions Institute (HSI) and Presidents United to Solve Hunger (PUSH) at Auburn University on April 25, 2019.
The Ag Data Commons is a platform for aggregating, cataloging, and sharing agricultural data. It harvests metadata from various federal and university repositories to make data more discoverable without duplication of submission efforts. Currently it catalogs open datasets and links them to related literature. In the future, it aims to harvest more funding information and methodological details to better link datasets to associated research articles and grants. The goal is to organize agricultural data according to shared standards to make it fully machine-readable and reusable to support further research and decision-making.
Biodiversity informatics and the agricultural data landscapeCyndy Parr
Introductory talk of a symposium on Agrobiodiversity informatics at the 2016 annual meeting of the Biodiversity Information Standards. Begins with an overview of the symposium and its speakers, and then launches into my talk.
Public access to research results at USDACyndy Parr
An update on public access activities at the National Agricultural Library and next steps, presented 11 January 2017 at the Earth Science Information Partners (ESIP) meeting in Bethesda, Maryland.
Ag Data Commons: Agricultural research metadata and dataCyndy Parr
The document proposes the Ag Data Commons as a solution to address challenges with agricultural research data by creating a central repository to host metadata and data according to federal directives for public access. It outlines the goals of the Ag Data Commons to support public access mandates through a sustainable platform for hosting and sharing agricultural research data and metadata in both human and machine-readable formats. The document also provides details on the workflow for submitting and publishing data on the Ag Data Commons to ensure standardized metadata and compliance with best practices.
Ag Data Commons: A new USDA catalog and repository for agricultural research ...Cyndy Parr
The document summarizes the USDA's Ag Data Commons, a new catalog and repository for agricultural research data. It provides an overview of the National Agricultural Library's Knowledge Services Division, which manages the Ag Data Commons. Key points include that the Ag Data Commons provides data repository, curation, and management services; supports the open data initiative; and has grown from a prototype in 2015 to include almost 200 datasets from over 35 non-NAL users in its pilot phase in 2016. The goal is for the Ag Data Commons to become a centralized catalog and repository for open agricultural research data.
Preparing for data-intensive science across domains.Cyndy Parr
Presented at American Institute for Biological Sciences council meeting 8 December 2015. I focus on anecdotes from multiple domains on the kinds of skills and trajectories that empower scientists at multiple levels to become engaged in data-intensive science as data wranglers or tool-builders. Even if they don't have lots of funding from NSF or NIH.
This document discusses the Ag Data Commons, a proposed solution for aggregating and providing access to open agricultural research data. It would support public access mandates by hosting USDA and other agricultural data. The Ag Data Commons would provide both human and machine access to metadata and data. It would integrate existing databases and repositories and add value by standardizing metadata, assigning DOIs, and linking to related data and literature. The document considers options for the technical platform, focusing on standards for metadata, controlled vocabularies, and trusted data repository requirements.
Ag Data Commons: Adding Value to open agricultural research dataCyndy Parr
A talk presented on 30 September 2013 at the Biodiversity Information Standards (Taxonomic Databases Working Group TDWG) annual meeting in Nairobi, Kenya
This document provides an overview of the TDWG annual meeting in Nairobi, Kenya. It lists the meeting themes, registration details, membership numbers, executive committee members, and highlights of the program which includes symposia, talks, posters, workshops and interest groups. A history of past meeting locations is also included.
Practical interoperability across semantic stores of data for ecological, tax...Cyndy Parr
Presented at the Biodiversity Information Standards (Taxonomic Databases Working Group) 2013 meeting in Florence, Italy on 31 October 2013. Essentially, an introduction to aspects of the back end of the new trait repository of Encyclopedia of Life.
Using and extending Darwin Core for structured attribute dataCyndy Parr
Presented at the Biodiversity Information Standards (Taxonomic Databases Working Group) 2013 meeting in Florence, Italy on 29 October 2013. Essentially, an introduction to the new trait repository of Encyclopedia of Life.
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...Cyndy Parr
The document summarizes the Encyclopedia of Life (EOL) project, which aims to create a webpage for every known species. It discusses how EOL works by crowdsourcing content from over 240 providers and harvesting data from third party applications. EOL currently has pages for over 1.1 million species and sees 3 million unique visitors annually. The document outlines ongoing efforts to make EOL's large volume of species data more computable through linking data to external ontologies, promoting text mining and crowdsourcing of data, and developing infrastructure for standardized access and analysis of species interaction networks and trait information.
A talk presented January 19, 2013 in the Indo-US Joint Workshop on Biodiversity Informatics at the Ashoka Trust for Research in Ecology and the Environment in Bangalore, India.
A talk presented January 20, 2013 in the Indo-US Joint Workshop on Biodiversity Informatics at the Ashoka Trust for Research in Ecology and the Environment in Bangalore, India.
Leveraging an international infrastructure: Case studies from the Encyclopeda...Cyndy Parr
This document summarizes a presentation about leveraging international infrastructure for species descriptions using the Encyclopedia of Life (EOL) as a case study. It describes EOL's efforts to aggregate and curate over 1 million taxon pages from 200 providers. It analyzes the types and languages of content, license restrictions, ratings of providers, and the roles of curators. It also discusses opportunities to improve standards, support quality control, and make content more multilingual and open. Case studies demonstrate how EOL coordinates with other databases to resolve errors. The presentation concludes that EOL has made progress but there is still room to expand coverage and engage more users, content providers, and funders.
EOL aggregates scientific data from various databases about all species globally to provide summaries for various audiences including enthusiasts, learners, citizen scientists, and scientists. It utilizes crowd-sourcing to improve data quality and provide computable data for research through features like collections, APIs, and challenges. Future enhancements aim to further enhance EOL's capabilities for scientific research.
1) EOL China collects and aggregates species data from various Chinese institutions into centralized databases for animals, plants, and microbes.
2) Data is collected in Bisby Core XML format and made available through a central EOL China portal and specialized websites.
3) The databases currently contain over 22,000 pages of information on Chinese species, with the goal of expanding coverage over the next few years.
The Western Ghats Portal (http://thewesternghats.in) is an open collaborative information system launched in January 2012 to disseminate biodiversity and conservation knowledge about the Western Ghats region. It was initiated to aggregate data from various partner institutions and provide an open access platform. Currently, it contains over 150 map layers, 600 species pages, and over 110,000 occurrence records. The portal is funded by CEPF until 2013 and aims to build a participative community and governance structure for long term sustainability. Key challenges include mobilizing additional data contributions and ensuring data quality at scale.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Best 20 SEO Techniques To Improve Website Visibility In SERP
The emerging biodiversity data ecosystem
1. The emerging biodiversity data ecosystem Cynthia Parr, Katja Schulz, Jennifer Hammock Smithsonian Institution Nathan Wilson, Patrick Leary Marine Biological Laboratory Richard Allen Environmental Protection Agency
2. Today’s story What is EOL Core questions Network analysis Hotlist development Page richness algorithm Conclusion: improving the health and richness of our knowledge network advances understanding
9. EOL is a content curation community Content providers Databases Journals LifeDesks Public contributions Curating Aggregation Commenting Tagging http://www.eol.org
10. Core questions Where is our knowledge about biodiversity? Where are the gaps? What are the most effective ways to fill gaps given our limited resources?
11. Network analysis with Anne Bowser, University of Maryland EOL GBIF NCBI EOL connects hubs
14. Implications and next steps Need more data Identify isolated projects & mechanisms for connecting them to the network Improve resilience & redundancy Distribute annotation & quality control Model data flow quantity and impact
18. Developing the EOL hot list Consultation with taxonomic experts Development of criteria Assembly of critical lists Establishing targets for rich taxon pages, lesser known pages
19. EOL’s hot lists Hot List Red Hot List 70,000 taxa Conservation concern Invasives Model organisms Ecologically important Pests Charismatics Data availability 2,800 taxa Most searched Top 100 invasives Crops (food) Zoos & aquaria High traffic Higher taxa
20. Taxon page richness algorithm 60% 30% 10% Breadth: Images, topics of text objects, references, maps, videos, sounds, conservation status Depth: # words per text object, # words total Diversity: Sources (partners) + + a (Breadth) b (Depth) c (Diversity) 0 – 1, Threshold 0.4
21. Summary of EOL page richness Overall Hot List 640,000 have content 2 % are rich 25 % have only links to literature 28 % of 75K are rich Average richness = 0.30 Red Hot List 56 % of 3K are rich Average richness = 0.43
22. Strategies for improving richness Crowd-sourcing Leveraging Collections Communities Mobile apps Enabling platforms Enabling journals Data mining BHL etc. Version 2 Coming in Fall 2011!
23. The page richness index Helps fill gaps with existing knowledge Helps prioritize funding and training so that it has maximum impact on closing true gaps Will be available via API Computing and storing richness index on EOL is a step towards storing and serving computable data
24. Dynamic data summaries = new knowledge Summarize data within a partner, then across partners. For example: compute an average value for one taxon (x specimens), compare to range of values across all taxa (621,393 samples) Atlantic Cod Gadusmorhua Jen Hammock (EOL) Edward van den Berge (OBIS)
28. Thank you http://www.eol.org 160+ content partners 2000 Flickr contributors 1000s Wikipedia contributors 43,000 EOL members Funding:John D. and Catherine T. MacArthur Foundation, Alfred P. Sloan Foundation, Cornerstone Institutions, Private Donors See Demo and Version 2 sneak peak in Software Bazaar Leadership: Erick Mata, Bob Corrigan, Mark Westneat, Marie Studer, Tom Garnett, Jim Edwards, David Patterson, Developers: Peter Mangiafico, Jeremy Rice, DimitriMozzherin, David Shorthouse, Lisa Whalley and others Biologists: Tanya Dewey, Audrey Aronowsky, Leo Shapiro
Editor's Notes
Conclusion is that there is value to treating all the biodiversity information systems as part of an interconnected ecosystem. We can study the connections, we can assess depth of infomraiton in the network. I’ll focus on EOL’s role in the system, but I hope to make observations that will be generally useful too
Objects such as these are essentially chunks of text sorted by topic. Span biology from physiology to ecology to evolutionEach of these credits the source, and can receive comments or ratings, or can be trusted or untrusted by curators.
So, the approach of EOL is rather different than many other sites. EOL is a giant mashup that creates pages, that are then available for curators (mostly credentialed scientists) to assess and rate, or for anybody to provide comments or tags.160+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages600 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
Represents about 1600 projects, and 1700 instances of data flow or hyperlinks between them. Size of the vertex, or node, reflects degree, or how many links the node has. We used the Claust-Newman-Moore algorithm to determine which vertices grouped together, then gave each group a color code. Those nodes with a degree of 15 or higher are labeled, and their edges are shown thicker than the others. These are the hubsThese are the hubs of this network, and they are reasonably well connected to each other. (go through and expand the acronyms)
Daphne Fautin’sHexacorallians of the world
With this as a baseline, how connected and resilient is the network? Over time we want it to become more connected and resilient, both to enable discovery and recovery in case of catastrophic problems.We can also use this to develop effective mechanisms to annotate data and improve data quality. If the same data appear on different parts of the network, and someone reports an error, the repair of that data needs to propagate effectively. What are the factors that influence data flow quantity and effectiveness…
Brighter green has higher % descendents with text, size of square is number of descendents square root scaled
Ecologically important – keystone species, indicator species
Inspired by community ecology & measures of species diversity, which of course were originally inspired by information theory, but we haven’t used those measures. Instead we put together these factors in a way that we could assign weights to different factors based on how well they capture “a rich page”We sampled dozens of pages and had team members assess them for their gestalt “richness” based on their own criteria. Then we compared those scores to those generated by the algorithm, and iteratively changed weights until we achieved a set of weights that appeared to reflect human perception of “richness.”Note that there’s a penalty that unvetted material is only worth about 75% of vetted materialAlso there are maximums for many of these input values – having 200 images may not make a page much more rich than having 25 images.Reserve the right to change this to ensure that the index is as useful as possible. Like Google PageRank, want to ensure that nobody can game the system.
Also note that there is an implication that a “rich page” is a “high quality page” – not necessarily true but often it is.As EOL goes forward with our version 2 we’ll be gathering other inputs that can tell us if a page is successful – ratings of its objects, for example.
Here’s what we are already doing – for the OBIS specimens which have rich environmental data associated with themCould add simllar values from other partners, for example from GenBank where some samples that are sequenced are collected from known envorinments, or from ecological studies that aren’t part of the specimen based system.Could subscribe to this value and get alerts if new values that come in that are outside this range.Could set up an model for this taxon and its relatives, predicting expected values, then if new values are aggregratedfrom any of EOL’s partners that violate the model, the scientist who has published the model gets a notification, could be there’s a flaw in the data integration, some violation of assumptions about the measurement workflow. Or could be that there’s something we truly didn’t understand before.Truly leveraging the scientific output of many researchers, better use of resources, more rapid advances in understanding of biological systems.
Analogousto the study of ecosystems where we seek to build an understanding of entire systems with many kinds of inputs, both biotic and abiotic