This document discusses efforts to digitize records of Hemiptera insect specimens and their host plants. Over 1 million insect specimens from 124 families have been digitized so far, including records of what plants they were collected on. The author analyzes the data to determine the probability that a hemipteran insect species was collected on a federally endangered or threatened plant in the US. The analysis finds that 19 species have over a 10% probability of being collected on an endangered host plant, and 9 species are only known from endangered hosts. Future work is suggested to expand this type of analysis to more insect groups and host plant data.
1. Recreating biomes one label at a time
Katja C. Seltmann
American Museum of Natural History
enicospilus@gmail.com
2. Human-mediated Disturbance
Our great loss in biodiversity that is difficult to
calculate. Many efforts exist to sample a snapshot of
present biodiversity as the world is rapidly changing
from increased human activity.
enicospilus@gmail.com
3. Historical Ecology
What we grow up with today ends up as the baseline
for our viewpoint of “nature” in the future.
enicospilus@gmail.com
8. IUCN red list status
enicospilus@gmail.com
0.52% species of insects evaluated
9. Heteroptera have declining plant hosts
enicospilus@gmail.com
IUCN Red List:
Worldwide total plants: 345
81 plant species
322 Hemiptera species
USDA Plants List:
(Federally listed Endangered or Threatened)
North American total plants: 752
31 plant species
127 Hemiptera species
11. Conclusions
enicospilus@gmail.com
0.0 0.2 0.4 0.6 0.8 1.0
0246810
p(x|y)
#InsectSpecies
red-listed
not red-listed
In this dataset (31 hemipteran
species collected from USDA red-
listed plants and all of their other
known hosts)
there is a higher probability that the
insect was collected on a
red-listed host than non red-listed.
19 species have a > 10%
probability that they were
collected on a federally
Endangered or Threatened
Plant.
9 species are only known to be
collected on a red-listed species.
12. • Continue to explore new methods for examining
this data (Data Science: Ontology & Machine
Learning).
– Explore data bias of collectors by adding Collector into
the equation of p(x|Y).
– Include a third trophic level (parasitoids) into the data
analysis.
– Expand to world plant host list and other insect
records outside of Hemiptera.
– Include known phylogenies of insect and plant.
Future directions
enicospilus@gmail.com
13. Acknowledgements
•TTD-TCN project PIs, digitizers and
managers
•Randall T. Schuh
•National Science Foundation
•iDigBio and www.datacarpentry.org
•Museum collections and curators
worldwide
Editor's Notes
At the present time we are experiencing a great loss in biodiversity. However, the amount of loss is difficult to calculate. Many efforts exist to sample a snapshot of that diversity now (inventory studies, hotspot conservation and DNA tissue banks) as the world is rapidly changing from increased human activity.
There is a budding field of ecology known as “historical ecology”, where we recognize that our conception of nature is changing with every new generation. We tend to accept what is natural around us as we grow up as a generational baseline for conservation. Examples include how New Yorkers consider Central Park a natural area, or the English vision of the naturalness of the English countryside, to the ever-changing view regarding the ideal composition of our national forests. If we only look at what is presently found in a given environment, we miss the historical ecology of an area and its future potential for conservation and restoration.
Notes: (what do you eat for breakfast).
Our conception of nature / natural world changes with each generation.
We are aware of modifications in our natural environment typically on a very gross scale. For example, the American Chestnut, Castanea dentata, is a large, monoecious deciduous tree of the beech family native to eastern North America. Before the species was devastated by the chestnut blight, a fungal disease, it was one of the most important forest trees throughout its range. Along with the devastation of the majestic chestnut, all of its animal associates, also had to adjust to the resulting modification to the forests. The question is how can we look into the past, in order to reveal how some of these changes occurred.
In this analysis I focused on one group of insects. The Hemiptera (e.g., cicadas, aphids, planthoppers, assassin bugs, milkweed bugs, leafhoppers, treehoppers, plant bugs, stink bugs, and many others) is a highly diverse order of insects, with an estimated 100,000 species worldwide, and around 11,150 species documented for North America. About 85% of Hemiptera feed on plants by directly piercing tissues, and many show a high degree of plant host specificity. The analysis took advantage of the data collection efforts of two major NSF funded projects focused on Hemiptera, The Plant Bug PBI, and the Tri-Trophic TCN, that together aggregated one of the most comprehensive specimen datasets for any order of insect. This dataset, termed the "Hemipteran Dataset", represents over 1.5 million specimen data records digitized from 190 different natural history collections, 141 hemipteran families, 310 host plant families, 413,400 recorded species interactions, over 1200 habitat descriptions, and contain historical specimens collected between 1890 to the present. This data, obtained from natural history collection specimen labels, contain information about the specimen at the time it was collected including specimen sex, life history stage, phenology, collection location, habitat, and host plant. The Hemipteran Dataset is almost unique in its comprehensive coverage for one group of insects in this regard.
Hemiptera consist of 50-80 thousand species worldwide, many of which use a proboscis to feed on plants. The Aphididae and the Plant Bugs, in the family Miridae, are two diverse families within the order, both of which commonly feed on plant hosts. As the favorite host species declines, host switching may occur as some hemipterans have diverse feeding habits. However, it is well known from the literature and discussions with domain experts, that some groups of insects are not able to host switch so readily.
The challenge is to calculate the probability, utilizing literature and specimen data records, to determine if a host switching event could occur, or the probability that given a certain insect (x) found on plant (y). And the higher the probability is an indication that the insect may not have the ability to host switch.
Dealing with messy data requires leaning toward probability calculations and the data we are collecting from the efforts to digitize natural history collections are beautifully messy data.
Natural history collections are our window into the ecology of the past, but we have a grand challenge that the data are non-standard, inconsistent, not in a digital format, and difficult to summarize. Error is introduced in the data either from the collecting and curation methods, miss-identification, or unsubstantiated observation in the literature. When dealing with rare events, such as collecting on endangered plants, parsing out these errors can be a challenge. The data is known to be difficult, but methods do exist, specifically in the computer science field of “machine learning” to help us deal with heavily biased data.
The hemipterian dataset contains over a million records. All digitized as part of one of the first Thematic Collection Network projects, the Plants, Herbivores, and Parasitoids: A Model System for the study of Tri-Trophic Associations and a Planetary Biodiversity Inventory project for plant bugs. The data includes 124 total hemipterian families, ranging in date from 1811-present.
Reality that the amount of data deficiency in calculating biodiversity loss on non-keystone species is a great challenge. Insects are some of the most numerous organisms on the planet, with unequal biodiversity, of which we have no idea the impact in our changing world. These numbers retrieved from the International Union for Conservation of Nature (IUCN) website indicate the number of evaluated species for each animal class. Mammals, birds and reptiles have been comparatively easier to evaluate, likely due to large body size. The most interesting number in this calculation however is the “data deficient” column. It seems that the majority of the 950,000 described species would predominately fall in this category, however it is unlikely that they will ever be evaluated.
Many plants however have been evaluated by the IUCN as well as the USDA.
If we compare the entire million plus hemipterian data records to these red lists we see:
For IUCN
Worldwide total plants: 345
plant species
322 Hemiptera species
For USDA
North American total plants: 752
31 plant species
127 Hemiptera species
The network diagram in the back ground represents the hemipterian data network for all associated USDA red listed plants, plus any other plant that insect is recorded to be as a host. The blue balls are insect nodes (species) and the green ones are plant nodes (species). Every line in between is a connection between an insect and a plant. If we zoom in…
We can begin to see some trends.
1): Some plants are linked to many insects.
2) We can see that some insects are linked to many plants.
We want to understand the relative importance of a red-listed plant by the number of insects for which it is a host, and how opportunistic an insect is by how many plants are recorded hosts for that insect. We need to take into account how many times the insect was collected on any plant, as well as the number of collecting events on red-listed plants.
We then can calculate the probability, given our subset of data (red-listed USDA plants, and all insects observed to associate with those plants). Only multiple independent collecting events (i.e. singletons removed) were used in analysis.
For the 127 species of Hemiptera In this dataset (31 insects collected from USDA red-listed plants and all of their other known hosts)
there is a higher probability that the insect was collected on a red-listed host than non red-listed.
19 species have a > 10%
probability that they were
collected on a federally
Endangered or Threatened
Plant.
9 species are only know to be collected on a red-listed species.
Analysis methods beyond those that are conventionally applied in biodiversity research will be needed in order to extract meaningful information from these natural history collection data. Fortunately, research in Data Science has rapidly matured and been applied in other areas that involve large quantities of information (social media and genomics). Computer Science methods that include machine learning and ontology have yielded a variety of new techniques for pattern recognition, data quality assessment, and trend analysis. In very general terms, data science refers to the extraction of knowledge from data. Ontology and machine learning are methods and areas of research in computer science that are significantly associated with data science. Machine learning refers to the development of algorithms to facilitate pattern recognition, classification, and prediction, based on models derived from existing data. Ontology, or ontological reasoning, can be defined as the process of inferring information about the organization of descriptive terms utilizing structured and explicitly defined dictionaries.
In the future, I plan to continue to explore the human impacts on biodiversity, by taking bold steps toward the cross-pollination of new methods between biology, informatics, and computer science.
Many people made this effort possible. Efforts of museum curators, collection managers and collectors world-wide.