This research aimed to develop a tool to group bioassays from PubChem based on experimental parameters extracted from narratives using natural language processing (NLP). The researchers used Latent Semantic Indexing (LSI) to identify topics in over 2000 bioassay narratives from Pubmed abstracts. LSI was able to group assays without supervision but was sensitive to the number of tokens and concepts used, focusing on either species or chemical compounds. While encouraging, additional studies are needed to better control LSI's effectiveness for chemical modeling applications.
iBioSearch: The Integrated Biological Database Search
NLP_BioAssayPoster
1. Experimental Design
Abstract
Large amounts of bioassay data are being collected within public
repositories such as PubChem, however there is a lack of standardized
methods and techniques to enter and store the data, and more importantly,
promote automated grouping of bioassay topics. The main objective of this
research is to create a tool capable of 1) permitting a chemical domain
expert the ability to manually review and approve narrative tokens as they
are generated, and 2) permit automated grouping of bioassays based on
approved token sets. The ability to group bioassays in a semi-supervised
manner will permit chemists to extract more relevant compound-assay data
sets for Quantitative Structure Toxicity Relationship (QSTR) modeling in the
future.
Data
Pubmed is a publicly accessible document database developed and
maintained by the National Center for Biotechnology Information (NCBI).
Pubmed hosts information related to biomedicine and health, life sciences,
behavioral sciences, chemical sciences, and bioengineering, and is the
source of data for our project. Pubmed data utilized for this project
contained publication title and abstract. Each abstract described the
biological assays, and the experimental parameters that generated the
results. These experimental parameters included species, small molecules,
and endpoint. For this experiment we utilized 2162 random bioassay
narratives.
Methods
Tokenization – Tokenization is the process by which a stream of text is
broken into word elements called tokens. Tokenization was used to
parse bioassay narratives into list of meaningful experimental
parameters including but not limited to small molecules, compounds,
endpoints. Regular expressions coding was used to match tokens or
separators within the narratives. A specific instance includes:
RegexpTokenizer = ('S*[[^]]*]S*|S*([^)]*)S*|S+’)
Dictionary Referencing – A domain specific dictionary was built utilizing
ChemBL, Bioassay Ontology (BAO) and MeSH. BAO includes concepts
and relationships relevant to biological assays, ChemBL is a chemical
database of bioactive molecules with drug-like properties, and MeSH is a
thesaurus used in a broad range of biochemical research. These
controlled vocabularies were utilized to both assist the chemist in
reviewing tokens and to permit automated validation of token content.
Latent Semantic Indexing (LSI) – LSI projects documents to lower
dimensional space from higher dimensional term space utilizing
singular-value decomposition (SVD). SVD reduces dimensions of the
document space and clusters semantically related documents.
LSI identifies the concepts contained in the text even when words with
different meaning are used in similar context.
Results
LSI captured high level topics and contextual summarization of bioassays.
This was accomplished with no supervised class training and substantiated by
reviewing the matching narrative belonging to each extracted topic. However,
LSI behaved differently under different parameters sets. The 10000 token/50
concept model focused on species as a primary topic whereas the 30000
token/400 concept model focused on chemical compounds as the primary
topic. This sensitivity to training parameters, and thus different concept
mappings, was consistently observed across nine different LSI parameter sets.
Conclusion
A software system was developed to demonstrate NLP's potential with
bioassay datasets. The test system created allowed a chemist to perform
human-assisted chemical terminology tokenization followed by
automated concept mapping. Although our NLP results were
encouraging, suggesting the feasibility of automated parsing and
stratification of complex bioassay descriptions, additional empirical
studies will be required to evaluate LSI parameter sensitivity on topic
extraction, and thus our ability to control this technique's effectiveness
in a chemical modeling environment.
Acknowledgments
This material is based upon work supported by IOMICS Corporation.
Special thanks to Joe Gormley for Technical Project Management. Special
thanks to Tom Zisk for Software Engineering support.
References
• BioAssay Ontology: Mader C, Datar N, Abeyruwan S, Koleti A, Venkatapuram S, Chung C, Puram D, Vempati U,
Sakurai K, Przydzial M, Lemmon V, Visser U, Schurer S. http://baosearch.ccs.miami.edu/baosearch/
• Latent Semantic Indexing: https://en.wikipedia.org/wiki/Latent_semantic_indexing
• NLTK: http://www.nltk.org/
• PubChem: Wang Y, Suzek T, Zhang J, Wang J, He S, Cheng T, Shoemaker BA, Gindulyte A, Bryant SH. PubChem
BioAssay: 2014 update. Nucleic Acids Res. 2014 Jan 1;42(1):D1075-82. Epub 2013 Nov 5 [PubMed PMID:
24198245] https://pubchem.ncbi.nlm.nih.gov
• PubMed Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2005-. PubMed Help.
[Updated 2015 Aug 7]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK3827/
Bioassay Concept Mapping: Limitation of Current NLP Technologies
Mohammed Ayub, Bir Kafle, Hai Lu, Suman Lama, Nikita Mutta
Advisor: Professor Fatemeh Emdad
Future Direction
Future improvements may include:
• Automate exploration of LSI experimental parameters to determine
effect and sensitivity on concept extraction.
• Deeper integration of BAO and ChemBL to support automated concept
categorization and validation.
System User Interface
Figure 1: Network Diagram for the 10000 token/50 concept model.
Figure 2: Narrative for the 10000 token/50 concept model.
Figure 3: Network Diagram for the 30000 token/400 concept model.
Figure 4: Narrative for the 30000 token/400 concept model.
Negative coefficient
Positive coefficient
Figure 5: NLP Test System User Interface
Negative coefficient
Positive coefficient