SlideShare a Scribd company logo
1 of 14
A Natural Language
Processing Approach to
Reviewing Research Abstracts
č‡Ŗē„¶č؀čŖžå‡¦ē†ć«ć‚ˆć‚‹ę–‡ēŒ®ćƒ¬ćƒ“ćƒ„ćƒ¼
INTERNATIONAL COLLEGE OF TECHNOLOGY, KANAZAWA
EDUCATION OUTGROWTH SYMPOSIUM AY2020
国際高ē­‰å°‚é–€å­¦ę ” PRESENTED BY
2020幓åŗ¦ę•™č‚²ęˆęžœē™ŗč”Ø会 ROBERT SONGER
1
Background
Literature Reviews
ā—¦ Necessary for research, such as capstone projects (å’ę„­ē ”ē©¶)
ā—¦ Include searching thousands of papers for relevant studies
ā—¦ Difficult to know when to stop searching
Searching Online Research Databases
ā—¦ Relies on basic search engines to find relevant papers
ā—¦ Complicated searches often yield too few or too many results
ā—¦ Context often lost when searching for subjects
2
Background
Joint research with Kanazawa University Practical Pharmacology Laboratory
Search for medical research on chemical compounds using Natural Language Processing (NLP)
Research topic:
"Adverse effects of cosmetic ingredients on human skin"
Hypothesis:
"A valid search technique will yield more relevant results for toxic compounds than it will for
non-toxic compounds."
3
Method
4
Data Mining &
Searching
ā€¢Download a list of cosmetic
ingredients compounds
ā€¢Search national databases for
data on each compound
ā€¢Eliminate compounds without
therapeutic uses
ā€¢Split list into toxic and non-
toxic compounds
ā€¢Search for abstracts about
adverse effects on skin by
each compound
NLP of Each
Abstract
ā€¢Tokenize words and
sentences in the text
ā€¢Process word frequencies for
top 5 words, compound
name, and skin-related words
ā€¢Calculate in-sentence
collocations for contextual
relevance
Data Analysis
ā€¢Eliminate duplicates and
abstracts with no mentions of
compound name or skin
ā€¢Sort by collocation ratio of
compound name and skin-
related words
ā€¢Examine the top 5 words of
each abstract and decide
whether to read it in full
Preparing the Data
Started with a list of ingredients for cosmetic
products (cosing) totaling 28,004
compounds1
With this list:
ā—¦ Queried PubChem database for compounds
ā—¦ Removed any cosing with no therapeudic uses
ā—¦ Made a list of toxic compounds
ā—¦ Made a list of non-toxic compounds
For each compound:
ā—¦ Queried PubMed database for abstracts
ā—¦ Downloaded abstract text to do NLP
5
1 Downloaded from the EU Open Data Portal at
https://data.europa.eu/euodp/en/data/dataset/cosmetic-ingredient-database-ingredients-and-fragrance-inventory
PubChem
Database2
Data for each compound includes:
ā—¦ Therapeutic uses
ā—¦ Toxicity
Queried the PubChem "PUG View"
web service for each compound &
processed XML results into CSV files:
ā—¦ Toxic w/ therapeutic uses
ā—¦ Non-toxic w/ therapeutic uses
6
2 PubChem, National Library of Medicine, National Center for Biotechnology Information, https://pubchem.ncbi.nlm.nih.gov/
PubMed
Database3
Built and ran a program to query PubMed
ā€œEntrezā€ E-Utilities for research abstracts
and obtain their IDs
Query URL:
https://eutils.ncbi.nlm.nih.gov/en
trez/eutils/esearch.fcgi?db=&term=
"methanol""adverse effect""skin"
Downloaded each abstract found
Abstract URL:
https://eutils.ncbi.nlm.nih.gov/en
trez/eutils/efetch.fcgi?db=PubMed&
retmode=xml&ID=17613130
7
3 PubMed, National Library of Medicine, National Center for Biotechnology Information, https://pubmed.ncbi.nlm.nih.gov/
Natural Language Processing
Natural Language Toolkit (NLTK)4 is a Python library for basic NLP functions:
Tokenization ā€“ Splitting natural language into smaller units for easier processing
ā—¦ We split abstract text into word tokens and sentence tokens
Frequency Distributions ā€“ Basically, counting words in the text
ā—¦ After removing stop words (ā€œaā€, ā€œtheā€, ā€œitā€, etc.), we got the 5 most common words in each abstract
Collocations ā€“ Multiple words appearing together in the text, such as ā€œtechnical collegeā€
ā—¦ Expanded the scope to check entire sentences for compound names and skin words appearing together
8
4 Natural Language Toolkit, https://www.nltk.org/
1
2
3
4
5
6
7
8
9
10
11
12
import nltk
nltk.download('punkt')
sentence_detector = nltk.data.load("tokenizers/punkt/english.pickle")
def tokenizeAbstract(abstractText):
abstractTokens = []
words = nltk.word_tokenize(abstractText)
words = [w.lower() for w in words]
sentences = sentence_detector.tokenize(abstractText.strip())
abstractTokens.append((sentences, words))
return abstractTokens
Tokenizing by Words and Sentences
9
Tokenize all words in lowercase
(lines 8-9)
Prepare a pre-trained model for
detecting and tokenizing sentences
(lines 3-4)
Return sentence and word lists
(lines 11-12)
Remove whitespace around the
text and tokenize sentences
(lines 10)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from nltk.probability import FreqDist
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words("english")
mystopwords = [".", ",", "(", ")","%",";",":"]
stopwords.extend(mystopwords)
def getTopFiveWords(words):
filtered_words = []
for w in words:
if w not in stopwords:
filtered_words.append(w)
freqDist = nltk.probability.FreqDist(filtered_words)
return freqDist.most_common(5)
Top 5 Words Frequency Distribution
10
Prepare English stop words
with punctuation added
(lines 3-6)
Filter out stop words and create
the Frequency Distribution
(lines 11-15)
Return list of top 5 words
(line 17)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def getSkinAndCompoundNameCollocations(compound_name, sentences):
collocations = 0
skin_words = ["skin","dermal","epidermis"]
for sentence in sentences:
sentence = sentence.lower()
skin_found = False
for w in skin_words:
if sentence.find(w) > -1:
skin_found = True
break
if skin_found and sentence.find(compound_name.lower()) > -1:
collocations += 1
return collocations / len(sentences)
Calculating Sentence Collocation Ratios
11
Skin-related words to search for in each sentence
Search each sentence for the skin-related words
(lines 9-12)
Count the sentence if it also has compound name
(lines 14-15)
Return a ratio of collocation sentences to total sentences
Analyzing the Data
CSV results were analyzed in Microsoft Excel:
ā—¦ Found duplicate abstracts that appeared for more than one compound
ā—¦ Filtered out abstracts with no compound name, skin-related words, or collocations
ā—¦ Chose abstracts to manually review based on the relevance of top 5 words
12
Toxic Compounds Non-Toxic Compounds Total
Total abstracts 287 261 548
Unique abstracts 221 80 301
Abstracts w/ collocations 49 2 51
Relevant abstracts 16 0 16
Conclusion
ā˜† Our method searched for 28,004 cosmetic ingredients, analyzed 301 unique abstracts, and
found 16 relevant studies about adverse effects on human skin.
Zero relevant results came from non-toxic compounds.
Querying the PubChem and PubMed databases took 15-30 minutes due to network restrictions.
NLP on all abstract texts completed in less than a second.
Using NLP can greatly reduce the overhead of literature reviews for complicated subjects.
13
Thank You!
Source code available on GitHub:
https://github.com/rsonger/CosIng-Toxicity
15

More Related Content

Similar to A Natural Language Processing Approach to Reviewing Research Abstracts

Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
Andrew McEachran
Ā 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
Kamel Mansouri
Ā 

Similar to A Natural Language Processing Approach to Reviewing Research Abstracts (20)

NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis code
Ā 
Systematic reviews searching part 2 2019
Systematic reviews searching part 2 2019Systematic reviews searching part 2 2019
Systematic reviews searching part 2 2019
Ā 
HOW TO FIND INFORMATION ON THE INTERNET
HOW TO FIND INFORMATION ON THE INTERNETHOW TO FIND INFORMATION ON THE INTERNET
HOW TO FIND INFORMATION ON THE INTERNET
Ā 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
Ā 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
Ā 
Journal Club - Best Practices for Scientific Computing
Journal Club - Best Practices for Scientific ComputingJournal Club - Best Practices for Scientific Computing
Journal Club - Best Practices for Scientific Computing
Ā 
How to Conduct a Systematic Search
How to Conduct a Systematic SearchHow to Conduct a Systematic Search
How to Conduct a Systematic Search
Ā 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
Ā 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
Ā 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
Ā 
How to Conduct a Literature Review (ISRAPM 2014)
How to Conduct a Literature Review  (ISRAPM 2014)How to Conduct a Literature Review  (ISRAPM 2014)
How to Conduct a Literature Review (ISRAPM 2014)
Ā 
Printout webinar r ax costanza 05 05-2020
Printout webinar r ax costanza 05 05-2020Printout webinar r ax costanza 05 05-2020
Printout webinar r ax costanza 05 05-2020
Ā 
Systematic Review
Systematic ReviewSystematic Review
Systematic Review
Ā 
Annotation And Curation Of Human Genomic Variations An ELIXIR Implementation...
Annotation And Curation Of Human Genomic Variations  An ELIXIR Implementation...Annotation And Curation Of Human Genomic Variations  An ELIXIR Implementation...
Annotation And Curation Of Human Genomic Variations An ELIXIR Implementation...
Ā 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
Ā 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery Systems
Ā 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
Ā 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
Ā 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
Ā 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Ā 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Ā 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Ā 

A Natural Language Processing Approach to Reviewing Research Abstracts

  • 1. A Natural Language Processing Approach to Reviewing Research Abstracts č‡Ŗē„¶č؀čŖžå‡¦ē†ć«ć‚ˆć‚‹ę–‡ēŒ®ćƒ¬ćƒ“ćƒ„ćƒ¼ INTERNATIONAL COLLEGE OF TECHNOLOGY, KANAZAWA EDUCATION OUTGROWTH SYMPOSIUM AY2020 国際高ē­‰å°‚é–€å­¦ę ” PRESENTED BY 2020幓åŗ¦ę•™č‚²ęˆęžœē™ŗč”Ø会 ROBERT SONGER 1
  • 2. Background Literature Reviews ā—¦ Necessary for research, such as capstone projects (å’ę„­ē ”ē©¶) ā—¦ Include searching thousands of papers for relevant studies ā—¦ Difficult to know when to stop searching Searching Online Research Databases ā—¦ Relies on basic search engines to find relevant papers ā—¦ Complicated searches often yield too few or too many results ā—¦ Context often lost when searching for subjects 2
  • 3. Background Joint research with Kanazawa University Practical Pharmacology Laboratory Search for medical research on chemical compounds using Natural Language Processing (NLP) Research topic: "Adverse effects of cosmetic ingredients on human skin" Hypothesis: "A valid search technique will yield more relevant results for toxic compounds than it will for non-toxic compounds." 3
  • 4. Method 4 Data Mining & Searching ā€¢Download a list of cosmetic ingredients compounds ā€¢Search national databases for data on each compound ā€¢Eliminate compounds without therapeutic uses ā€¢Split list into toxic and non- toxic compounds ā€¢Search for abstracts about adverse effects on skin by each compound NLP of Each Abstract ā€¢Tokenize words and sentences in the text ā€¢Process word frequencies for top 5 words, compound name, and skin-related words ā€¢Calculate in-sentence collocations for contextual relevance Data Analysis ā€¢Eliminate duplicates and abstracts with no mentions of compound name or skin ā€¢Sort by collocation ratio of compound name and skin- related words ā€¢Examine the top 5 words of each abstract and decide whether to read it in full
  • 5. Preparing the Data Started with a list of ingredients for cosmetic products (cosing) totaling 28,004 compounds1 With this list: ā—¦ Queried PubChem database for compounds ā—¦ Removed any cosing with no therapeudic uses ā—¦ Made a list of toxic compounds ā—¦ Made a list of non-toxic compounds For each compound: ā—¦ Queried PubMed database for abstracts ā—¦ Downloaded abstract text to do NLP 5 1 Downloaded from the EU Open Data Portal at https://data.europa.eu/euodp/en/data/dataset/cosmetic-ingredient-database-ingredients-and-fragrance-inventory
  • 6. PubChem Database2 Data for each compound includes: ā—¦ Therapeutic uses ā—¦ Toxicity Queried the PubChem "PUG View" web service for each compound & processed XML results into CSV files: ā—¦ Toxic w/ therapeutic uses ā—¦ Non-toxic w/ therapeutic uses 6 2 PubChem, National Library of Medicine, National Center for Biotechnology Information, https://pubchem.ncbi.nlm.nih.gov/
  • 7. PubMed Database3 Built and ran a program to query PubMed ā€œEntrezā€ E-Utilities for research abstracts and obtain their IDs Query URL: https://eutils.ncbi.nlm.nih.gov/en trez/eutils/esearch.fcgi?db=&term= "methanol""adverse effect""skin" Downloaded each abstract found Abstract URL: https://eutils.ncbi.nlm.nih.gov/en trez/eutils/efetch.fcgi?db=PubMed& retmode=xml&ID=17613130 7 3 PubMed, National Library of Medicine, National Center for Biotechnology Information, https://pubmed.ncbi.nlm.nih.gov/
  • 8. Natural Language Processing Natural Language Toolkit (NLTK)4 is a Python library for basic NLP functions: Tokenization ā€“ Splitting natural language into smaller units for easier processing ā—¦ We split abstract text into word tokens and sentence tokens Frequency Distributions ā€“ Basically, counting words in the text ā—¦ After removing stop words (ā€œaā€, ā€œtheā€, ā€œitā€, etc.), we got the 5 most common words in each abstract Collocations ā€“ Multiple words appearing together in the text, such as ā€œtechnical collegeā€ ā—¦ Expanded the scope to check entire sentences for compound names and skin words appearing together 8 4 Natural Language Toolkit, https://www.nltk.org/
  • 9. 1 2 3 4 5 6 7 8 9 10 11 12 import nltk nltk.download('punkt') sentence_detector = nltk.data.load("tokenizers/punkt/english.pickle") def tokenizeAbstract(abstractText): abstractTokens = [] words = nltk.word_tokenize(abstractText) words = [w.lower() for w in words] sentences = sentence_detector.tokenize(abstractText.strip()) abstractTokens.append((sentences, words)) return abstractTokens Tokenizing by Words and Sentences 9 Tokenize all words in lowercase (lines 8-9) Prepare a pre-trained model for detecting and tokenizing sentences (lines 3-4) Return sentence and word lists (lines 11-12) Remove whitespace around the text and tokenize sentences (lines 10)
  • 10. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from nltk.probability import FreqDist nltk.download('stopwords') stopwords = nltk.corpus.stopwords.words("english") mystopwords = [".", ",", "(", ")","%",";",":"] stopwords.extend(mystopwords) def getTopFiveWords(words): filtered_words = [] for w in words: if w not in stopwords: filtered_words.append(w) freqDist = nltk.probability.FreqDist(filtered_words) return freqDist.most_common(5) Top 5 Words Frequency Distribution 10 Prepare English stop words with punctuation added (lines 3-6) Filter out stop words and create the Frequency Distribution (lines 11-15) Return list of top 5 words (line 17)
  • 11. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 def getSkinAndCompoundNameCollocations(compound_name, sentences): collocations = 0 skin_words = ["skin","dermal","epidermis"] for sentence in sentences: sentence = sentence.lower() skin_found = False for w in skin_words: if sentence.find(w) > -1: skin_found = True break if skin_found and sentence.find(compound_name.lower()) > -1: collocations += 1 return collocations / len(sentences) Calculating Sentence Collocation Ratios 11 Skin-related words to search for in each sentence Search each sentence for the skin-related words (lines 9-12) Count the sentence if it also has compound name (lines 14-15) Return a ratio of collocation sentences to total sentences
  • 12. Analyzing the Data CSV results were analyzed in Microsoft Excel: ā—¦ Found duplicate abstracts that appeared for more than one compound ā—¦ Filtered out abstracts with no compound name, skin-related words, or collocations ā—¦ Chose abstracts to manually review based on the relevance of top 5 words 12 Toxic Compounds Non-Toxic Compounds Total Total abstracts 287 261 548 Unique abstracts 221 80 301 Abstracts w/ collocations 49 2 51 Relevant abstracts 16 0 16
  • 13. Conclusion ā˜† Our method searched for 28,004 cosmetic ingredients, analyzed 301 unique abstracts, and found 16 relevant studies about adverse effects on human skin. Zero relevant results came from non-toxic compounds. Querying the PubChem and PubMed databases took 15-30 minutes due to network restrictions. NLP on all abstract texts completed in less than a second. Using NLP can greatly reduce the overhead of literature reviews for complicated subjects. 13
  • 14. Thank You! Source code available on GitHub: https://github.com/rsonger/CosIng-Toxicity 15