SlideShare a Scribd company logo
1 of 1
Download to read offline
Experimental Design
Abstract
Large amounts of bioassay data are being collected within public
repositories such as PubChem, however there is a lack of standardized
methods and techniques to enter and store the data, and more importantly,
promote automated grouping of bioassay topics. The main objective of this
research is to create a tool capable of 1) permitting a chemical domain
expert the ability to manually review and approve narrative tokens as they
are generated, and 2) permit automated grouping of bioassays based on
approved token sets. The ability to group bioassays in a semi-supervised
manner will permit chemists to extract more relevant compound-assay data
sets for Quantitative Structure Toxicity Relationship (QSTR) modeling in the
future.
Data
Pubmed is a publicly accessible document database developed and
maintained by the National Center for Biotechnology Information (NCBI).
Pubmed hosts information related to biomedicine and health, life sciences,
behavioral sciences, chemical sciences, and bioengineering, and is the
source of data for our project. Pubmed data utilized for this project
contained publication title and abstract. Each abstract described the
biological assays, and the experimental parameters that generated the
results. These experimental parameters included species, small molecules,
and endpoint. For this experiment we utilized 2162 random bioassay
narratives.
Methods
Tokenization – Tokenization is the process by which a stream of text is
broken into word elements called tokens. Tokenization was used to
parse bioassay narratives into list of meaningful experimental
parameters including but not limited to small molecules, compounds,
endpoints. Regular expressions coding was used to match tokens or
separators within the narratives. A specific instance includes:
RegexpTokenizer = ('S*[[^]]*]S*|S*([^)]*)S*|S+’)
Dictionary Referencing – A domain specific dictionary was built utilizing
ChemBL, Bioassay Ontology (BAO) and MeSH. BAO includes concepts
and relationships relevant to biological assays, ChemBL is a chemical
database of bioactive molecules with drug-like properties, and MeSH is a
thesaurus used in a broad range of biochemical research. These
controlled vocabularies were utilized to both assist the chemist in
reviewing tokens and to permit automated validation of token content.
Latent Semantic Indexing (LSI) – LSI projects documents to lower
dimensional space from higher dimensional term space utilizing
singular-value decomposition (SVD). SVD reduces dimensions of the
document space and clusters semantically related documents.
LSI identifies the concepts contained in the text even when words with
different meaning are used in similar context.
Results
LSI captured high level topics and contextual summarization of bioassays.
This was accomplished with no supervised class training and substantiated by
reviewing the matching narrative belonging to each extracted topic. However,
LSI behaved differently under different parameters sets. The 10000 token/50
concept model focused on species as a primary topic whereas the 30000
token/400 concept model focused on chemical compounds as the primary
topic. This sensitivity to training parameters, and thus different concept
mappings, was consistently observed across nine different LSI parameter sets.
Conclusion
A software system was developed to demonstrate NLP's potential with
bioassay datasets. The test system created allowed a chemist to perform
human-assisted chemical terminology tokenization followed by
automated concept mapping. Although our NLP results were
encouraging, suggesting the feasibility of automated parsing and
stratification of complex bioassay descriptions, additional empirical
studies will be required to evaluate LSI parameter sensitivity on topic
extraction, and thus our ability to control this technique's effectiveness
in a chemical modeling environment.
Acknowledgments
This material is based upon work supported by IOMICS Corporation.
Special thanks to Joe Gormley for Technical Project Management. Special
thanks to Tom Zisk for Software Engineering support.
References
• BioAssay Ontology: Mader C, Datar N, Abeyruwan S, Koleti A, Venkatapuram S, Chung C, Puram D, Vempati U,
Sakurai K, Przydzial M, Lemmon V, Visser U, Schurer S. http://baosearch.ccs.miami.edu/baosearch/
• Latent Semantic Indexing: https://en.wikipedia.org/wiki/Latent_semantic_indexing
• NLTK: http://www.nltk.org/
• PubChem: Wang Y, Suzek T, Zhang J, Wang J, He S, Cheng T, Shoemaker BA, Gindulyte A, Bryant SH. PubChem
BioAssay: 2014 update. Nucleic Acids Res. 2014 Jan 1;42(1):D1075-82. Epub 2013 Nov 5 [PubMed PMID:
24198245] https://pubchem.ncbi.nlm.nih.gov
• PubMed Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2005-. PubMed Help.
[Updated 2015 Aug 7]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK3827/
Bioassay Concept Mapping: Limitation of Current NLP Technologies
Mohammed Ayub, Bir Kafle, Hai Lu, Suman Lama, Nikita Mutta
Advisor: Professor Fatemeh Emdad
Future Direction
Future improvements may include:
• Automate exploration of LSI experimental parameters to determine
effect and sensitivity on concept extraction.
• Deeper integration of BAO and ChemBL to support automated concept
categorization and validation.
System User Interface
Figure 1: Network Diagram for the 10000 token/50 concept model.
Figure 2: Narrative for the 10000 token/50 concept model.
Figure 3: Network Diagram for the 30000 token/400 concept model.
Figure 4: Narrative for the 30000 token/400 concept model.
Negative coefficient
Positive coefficient
Figure 5: NLP Test System User Interface
Negative coefficient
Positive coefficient

More Related Content

What's hot

Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the futurePistoia Alliance
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014pratikomics
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialDmitry Grapov
 
Ensc 5530 jan2017 ci my draft
Ensc 5530 jan2017 ci my draftEnsc 5530 jan2017 ci my draft
Ensc 5530 jan2017 ci my draftciakov
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
 
Nowomics at Cambridge Open Research
Nowomics at Cambridge Open ResearchNowomics at Cambridge Open Research
Nowomics at Cambridge Open ResearchNowomics
 
Sample Tracker: A web-based application for tracking and managing environment...
Sample Tracker: A web-based application for tracking and managing environment...Sample Tracker: A web-based application for tracking and managing environment...
Sample Tracker: A web-based application for tracking and managing environment...Gerard Devine
 
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionGerald Lushington
 
Rii stock centerdir_aug9_2016
Rii stock centerdir_aug9_2016Rii stock centerdir_aug9_2016
Rii stock centerdir_aug9_2016Anita Bandrowski
 
Multi-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsMulti-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsChristoph Steinbeck
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Pistoia Alliance
 

What's hot (20)

Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier Datathon
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
Ensc 5530 jan2017 ci my draft
Ensc 5530 jan2017 ci my draftEnsc 5530 jan2017 ci my draft
Ensc 5530 jan2017 ci my draft
 
MDC Connects: Make the Molecules that Matter
MDC Connects: Make the Molecules that MatterMDC Connects: Make the Molecules that Matter
MDC Connects: Make the Molecules that Matter
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
Nowomics at Cambridge Open Research
Nowomics at Cambridge Open ResearchNowomics at Cambridge Open Research
Nowomics at Cambridge Open Research
 
MDC Connects: Targeted compound libraries
MDC Connects: Targeted compound librariesMDC Connects: Targeted compound libraries
MDC Connects: Targeted compound libraries
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
Dr Julie Stahlhut - Barcode Data Life-cycle
Dr Julie Stahlhut - Barcode Data Life-cycleDr Julie Stahlhut - Barcode Data Life-cycle
Dr Julie Stahlhut - Barcode Data Life-cycle
 
Sample Tracker: A web-based application for tracking and managing environment...
Sample Tracker: A web-based application for tracking and managing environment...Sample Tracker: A web-based application for tracking and managing environment...
Sample Tracker: A web-based application for tracking and managing environment...
 
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
 
Rii stock centerdir_aug9_2016
Rii stock centerdir_aug9_2016Rii stock centerdir_aug9_2016
Rii stock centerdir_aug9_2016
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017
 
Multi-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsMulti-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application Domains
 
Biostatistics Conference
Biostatistics ConferenceBiostatistics Conference
Biostatistics Conference
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 

Viewers also liked

Predicting Toxicities with Bioassays
Predicting Toxicities with BioassaysPredicting Toxicities with Bioassays
Predicting Toxicities with BioassaysMatthew Clark
 
Biomonitoring: Its Expanding Role in Public Health Evaluations and Litigation
Biomonitoring: Its Expanding Role in Public Health Evaluations and LitigationBiomonitoring: Its Expanding Role in Public Health Evaluations and Litigation
Biomonitoring: Its Expanding Role in Public Health Evaluations and Litigationkurfirst
 
EPUB3 Now! at IDPF 2013 Digital Book
EPUB3 Now! at IDPF 2013 Digital BookEPUB3 Now! at IDPF 2013 Digital Book
EPUB3 Now! at IDPF 2013 Digital Bookliz_castro
 
Insect as pollution indicator
Insect as pollution indicator Insect as pollution indicator
Insect as pollution indicator rahulranjan720
 
Insects as bioindicator of environmental pollution
Insects as bioindicator of environmental pollutionInsects as bioindicator of environmental pollution
Insects as bioindicator of environmental pollutionMuhammad awais Aslam
 
Shariq bioassay
Shariq bioassayShariq bioassay
Shariq bioassaysharimycin
 
Bioindicators ppt
Bioindicators pptBioindicators ppt
Bioindicators pptChitra Nair
 

Viewers also liked (9)

Predicting Toxicities with Bioassays
Predicting Toxicities with BioassaysPredicting Toxicities with Bioassays
Predicting Toxicities with Bioassays
 
Biomonitoring: Its Expanding Role in Public Health Evaluations and Litigation
Biomonitoring: Its Expanding Role in Public Health Evaluations and LitigationBiomonitoring: Its Expanding Role in Public Health Evaluations and Litigation
Biomonitoring: Its Expanding Role in Public Health Evaluations and Litigation
 
EPUB3 Now! at IDPF 2013 Digital Book
EPUB3 Now! at IDPF 2013 Digital BookEPUB3 Now! at IDPF 2013 Digital Book
EPUB3 Now! at IDPF 2013 Digital Book
 
Insect as pollution indicator
Insect as pollution indicator Insect as pollution indicator
Insect as pollution indicator
 
Insects as bioindicator of environmental pollution
Insects as bioindicator of environmental pollutionInsects as bioindicator of environmental pollution
Insects as bioindicator of environmental pollution
 
Bioindicators
BioindicatorsBioindicators
Bioindicators
 
Shariq bioassay
Shariq bioassayShariq bioassay
Shariq bioassay
 
Bioindicators ppt
Bioindicators pptBioindicators ppt
Bioindicators ppt
 
Bioassay
BioassayBioassay
Bioassay
 

Similar to NLP_BioAssayPoster

Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Yasel Cruz
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformaticscontactsoorya
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...CSCJournals
 
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGijbbjournal
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Elia Brodsky
 
ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012
ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012
ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012Susanna-Assunta Sansone
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsJoanne Luciano
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsJoanne Luciano
 
A Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In BioinformaticsA Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In BioinformaticsLori Moore
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
 

Similar to NLP_BioAssayPoster (20)

Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
mec
mecmec
mec
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
 
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
 
CV_10/17
CV_10/17CV_10/17
CV_10/17
 
Cv long
Cv longCv long
Cv long
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0
 
ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012
ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012
ISA Commons / BioSharing - Susanna-Assunta Sansone - ISMB 2012
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
A Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In BioinformaticsA Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In Bioinformatics
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
Biological Database
Biological DatabaseBiological Database
Biological Database
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
iBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database SearchiBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database Search
 

NLP_BioAssayPoster

  • 1. Experimental Design Abstract Large amounts of bioassay data are being collected within public repositories such as PubChem, however there is a lack of standardized methods and techniques to enter and store the data, and more importantly, promote automated grouping of bioassay topics. The main objective of this research is to create a tool capable of 1) permitting a chemical domain expert the ability to manually review and approve narrative tokens as they are generated, and 2) permit automated grouping of bioassays based on approved token sets. The ability to group bioassays in a semi-supervised manner will permit chemists to extract more relevant compound-assay data sets for Quantitative Structure Toxicity Relationship (QSTR) modeling in the future. Data Pubmed is a publicly accessible document database developed and maintained by the National Center for Biotechnology Information (NCBI). Pubmed hosts information related to biomedicine and health, life sciences, behavioral sciences, chemical sciences, and bioengineering, and is the source of data for our project. Pubmed data utilized for this project contained publication title and abstract. Each abstract described the biological assays, and the experimental parameters that generated the results. These experimental parameters included species, small molecules, and endpoint. For this experiment we utilized 2162 random bioassay narratives. Methods Tokenization – Tokenization is the process by which a stream of text is broken into word elements called tokens. Tokenization was used to parse bioassay narratives into list of meaningful experimental parameters including but not limited to small molecules, compounds, endpoints. Regular expressions coding was used to match tokens or separators within the narratives. A specific instance includes: RegexpTokenizer = ('S*[[^]]*]S*|S*([^)]*)S*|S+’) Dictionary Referencing – A domain specific dictionary was built utilizing ChemBL, Bioassay Ontology (BAO) and MeSH. BAO includes concepts and relationships relevant to biological assays, ChemBL is a chemical database of bioactive molecules with drug-like properties, and MeSH is a thesaurus used in a broad range of biochemical research. These controlled vocabularies were utilized to both assist the chemist in reviewing tokens and to permit automated validation of token content. Latent Semantic Indexing (LSI) – LSI projects documents to lower dimensional space from higher dimensional term space utilizing singular-value decomposition (SVD). SVD reduces dimensions of the document space and clusters semantically related documents. LSI identifies the concepts contained in the text even when words with different meaning are used in similar context. Results LSI captured high level topics and contextual summarization of bioassays. This was accomplished with no supervised class training and substantiated by reviewing the matching narrative belonging to each extracted topic. However, LSI behaved differently under different parameters sets. The 10000 token/50 concept model focused on species as a primary topic whereas the 30000 token/400 concept model focused on chemical compounds as the primary topic. This sensitivity to training parameters, and thus different concept mappings, was consistently observed across nine different LSI parameter sets. Conclusion A software system was developed to demonstrate NLP's potential with bioassay datasets. The test system created allowed a chemist to perform human-assisted chemical terminology tokenization followed by automated concept mapping. Although our NLP results were encouraging, suggesting the feasibility of automated parsing and stratification of complex bioassay descriptions, additional empirical studies will be required to evaluate LSI parameter sensitivity on topic extraction, and thus our ability to control this technique's effectiveness in a chemical modeling environment. Acknowledgments This material is based upon work supported by IOMICS Corporation. Special thanks to Joe Gormley for Technical Project Management. Special thanks to Tom Zisk for Software Engineering support. References • BioAssay Ontology: Mader C, Datar N, Abeyruwan S, Koleti A, Venkatapuram S, Chung C, Puram D, Vempati U, Sakurai K, Przydzial M, Lemmon V, Visser U, Schurer S. http://baosearch.ccs.miami.edu/baosearch/ • Latent Semantic Indexing: https://en.wikipedia.org/wiki/Latent_semantic_indexing • NLTK: http://www.nltk.org/ • PubChem: Wang Y, Suzek T, Zhang J, Wang J, He S, Cheng T, Shoemaker BA, Gindulyte A, Bryant SH. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014 Jan 1;42(1):D1075-82. Epub 2013 Nov 5 [PubMed PMID: 24198245] https://pubchem.ncbi.nlm.nih.gov • PubMed Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2005-. PubMed Help. [Updated 2015 Aug 7]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK3827/ Bioassay Concept Mapping: Limitation of Current NLP Technologies Mohammed Ayub, Bir Kafle, Hai Lu, Suman Lama, Nikita Mutta Advisor: Professor Fatemeh Emdad Future Direction Future improvements may include: • Automate exploration of LSI experimental parameters to determine effect and sensitivity on concept extraction. • Deeper integration of BAO and ChemBL to support automated concept categorization and validation. System User Interface Figure 1: Network Diagram for the 10000 token/50 concept model. Figure 2: Narrative for the 10000 token/50 concept model. Figure 3: Network Diagram for the 30000 token/400 concept model. Figure 4: Narrative for the 30000 token/400 concept model. Negative coefficient Positive coefficient Figure 5: NLP Test System User Interface Negative coefficient Positive coefficient