This document describes a natural language processing approach to reviewing medical research abstracts about cosmetic ingredients and their effects on human skin. The approach involved searching databases for toxicity data and research abstracts on over 28,000 cosmetic compounds. Natural language processing techniques like tokenization, frequency analysis, and collocation detection were used to analyze over 300 unique abstracts. The method identified 16 relevant studies on adverse skin effects, all of which were for toxic compounds rather than non-toxic compounds. The approach reduced the time needed for literature review from hours to seconds compared to manual searching.
A Natural Language Processing Approach to Reviewing Research Abstracts
1. A Natural Language
Processing Approach to
Reviewing Research Abstracts
čŖē¶čØčŖå¦ēć«ććęē®ć¬ćć„ć¼
INTERNATIONAL COLLEGE OF TECHNOLOGY, KANAZAWA
EDUCATION OUTGROWTH SYMPOSIUM AY2020
å½éé«ēå°éå¦ę ” PRESENTED BY
2020幓åŗ¦ęč²ęęēŗč”Øä¼ ROBERT SONGER
1
3. Background
Joint research with Kanazawa University Practical Pharmacology Laboratory
Search for medical research on chemical compounds using Natural Language Processing (NLP)
Research topic:
"Adverse effects of cosmetic ingredients on human skin"
Hypothesis:
"A valid search technique will yield more relevant results for toxic compounds than it will for
non-toxic compounds."
3
4. Method
4
Data Mining &
Searching
ā¢Download a list of cosmetic
ingredients compounds
ā¢Search national databases for
data on each compound
ā¢Eliminate compounds without
therapeutic uses
ā¢Split list into toxic and non-
toxic compounds
ā¢Search for abstracts about
adverse effects on skin by
each compound
NLP of Each
Abstract
ā¢Tokenize words and
sentences in the text
ā¢Process word frequencies for
top 5 words, compound
name, and skin-related words
ā¢Calculate in-sentence
collocations for contextual
relevance
Data Analysis
ā¢Eliminate duplicates and
abstracts with no mentions of
compound name or skin
ā¢Sort by collocation ratio of
compound name and skin-
related words
ā¢Examine the top 5 words of
each abstract and decide
whether to read it in full
5. Preparing the Data
Started with a list of ingredients for cosmetic
products (cosing) totaling 28,004
compounds1
With this list:
ā¦ Queried PubChem database for compounds
ā¦ Removed any cosing with no therapeudic uses
ā¦ Made a list of toxic compounds
ā¦ Made a list of non-toxic compounds
For each compound:
ā¦ Queried PubMed database for abstracts
ā¦ Downloaded abstract text to do NLP
5
1 Downloaded from the EU Open Data Portal at
https://data.europa.eu/euodp/en/data/dataset/cosmetic-ingredient-database-ingredients-and-fragrance-inventory
6. PubChem
Database2
Data for each compound includes:
ā¦ Therapeutic uses
ā¦ Toxicity
Queried the PubChem "PUG View"
web service for each compound &
processed XML results into CSV files:
ā¦ Toxic w/ therapeutic uses
ā¦ Non-toxic w/ therapeutic uses
6
2 PubChem, National Library of Medicine, National Center for Biotechnology Information, https://pubchem.ncbi.nlm.nih.gov/
7. PubMed
Database3
Built and ran a program to query PubMed
āEntrezā E-Utilities for research abstracts
and obtain their IDs
Query URL:
https://eutils.ncbi.nlm.nih.gov/en
trez/eutils/esearch.fcgi?db=&term=
"methanol""adverse effect""skin"
Downloaded each abstract found
Abstract URL:
https://eutils.ncbi.nlm.nih.gov/en
trez/eutils/efetch.fcgi?db=PubMed&
retmode=xml&ID=17613130
7
3 PubMed, National Library of Medicine, National Center for Biotechnology Information, https://pubmed.ncbi.nlm.nih.gov/
8. Natural Language Processing
Natural Language Toolkit (NLTK)4 is a Python library for basic NLP functions:
Tokenization ā Splitting natural language into smaller units for easier processing
ā¦ We split abstract text into word tokens and sentence tokens
Frequency Distributions ā Basically, counting words in the text
ā¦ After removing stop words (āaā, ātheā, āitā, etc.), we got the 5 most common words in each abstract
Collocations ā Multiple words appearing together in the text, such as ātechnical collegeā
ā¦ Expanded the scope to check entire sentences for compound names and skin words appearing together
8
4 Natural Language Toolkit, https://www.nltk.org/
9. 1
2
3
4
5
6
7
8
9
10
11
12
import nltk
nltk.download('punkt')
sentence_detector = nltk.data.load("tokenizers/punkt/english.pickle")
def tokenizeAbstract(abstractText):
abstractTokens = []
words = nltk.word_tokenize(abstractText)
words = [w.lower() for w in words]
sentences = sentence_detector.tokenize(abstractText.strip())
abstractTokens.append((sentences, words))
return abstractTokens
Tokenizing by Words and Sentences
9
Tokenize all words in lowercase
(lines 8-9)
Prepare a pre-trained model for
detecting and tokenizing sentences
(lines 3-4)
Return sentence and word lists
(lines 11-12)
Remove whitespace around the
text and tokenize sentences
(lines 10)
10. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from nltk.probability import FreqDist
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words("english")
mystopwords = [".", ",", "(", ")","%",";",":"]
stopwords.extend(mystopwords)
def getTopFiveWords(words):
filtered_words = []
for w in words:
if w not in stopwords:
filtered_words.append(w)
freqDist = nltk.probability.FreqDist(filtered_words)
return freqDist.most_common(5)
Top 5 Words Frequency Distribution
10
Prepare English stop words
with punctuation added
(lines 3-6)
Filter out stop words and create
the Frequency Distribution
(lines 11-15)
Return list of top 5 words
(line 17)
11. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def getSkinAndCompoundNameCollocations(compound_name, sentences):
collocations = 0
skin_words = ["skin","dermal","epidermis"]
for sentence in sentences:
sentence = sentence.lower()
skin_found = False
for w in skin_words:
if sentence.find(w) > -1:
skin_found = True
break
if skin_found and sentence.find(compound_name.lower()) > -1:
collocations += 1
return collocations / len(sentences)
Calculating Sentence Collocation Ratios
11
Skin-related words to search for in each sentence
Search each sentence for the skin-related words
(lines 9-12)
Count the sentence if it also has compound name
(lines 14-15)
Return a ratio of collocation sentences to total sentences
12. Analyzing the Data
CSV results were analyzed in Microsoft Excel:
ā¦ Found duplicate abstracts that appeared for more than one compound
ā¦ Filtered out abstracts with no compound name, skin-related words, or collocations
ā¦ Chose abstracts to manually review based on the relevance of top 5 words
12
Toxic Compounds Non-Toxic Compounds Total
Total abstracts 287 261 548
Unique abstracts 221 80 301
Abstracts w/ collocations 49 2 51
Relevant abstracts 16 0 16
13. Conclusion
ā Our method searched for 28,004 cosmetic ingredients, analyzed 301 unique abstracts, and
found 16 relevant studies about adverse effects on human skin.
Zero relevant results came from non-toxic compounds.
Querying the PubChem and PubMed databases took 15-30 minutes due to network restrictions.
NLP on all abstract texts completed in less than a second.
Using NLP can greatly reduce the overhead of literature reviews for complicated subjects.
13
14. Thank You!
Source code available on GitHub:
https://github.com/rsonger/CosIng-Toxicity
15