A Natural Language Processing Approach to Reviewing Research Abstracts

A Natural Language
Processing Approach to
Reviewing Research Abstracts
自然言語処理による文献レビュー
INTERNATIONAL COLLEGE OF TECHNOLOGY, KANAZAWA
EDUCATION OUTGROWTH SYMPOSIUM AY2020
国際高等専門学校 PRESENTED BY
2020年度教育成果発表会 ROBERT SONGER
1

Background
Literature Reviews
◦ Necessary for research, such as capstone projects (卒業研究)
◦ Include searching thousands of papers for relevant studies
◦ Difficult to know when to stop searching
Searching Online Research Databases
◦ Relies on basic search engines to find relevant papers
◦ Complicated searches often yield too few or too many results
◦ Context often lost when searching for subjects
2

Background
Joint research with Kanazawa University Practical Pharmacology Laboratory
Search for medical research on chemical compounds using Natural Language Processing (NLP)
Research topic:
"Adverse effects of cosmetic ingredients on human skin"
Hypothesis:
"A valid search technique will yield more relevant results for toxic compounds than it will for
non-toxic compounds."
3

Method
4
Data Mining &
Searching
•Download a list of cosmetic
ingredients compounds
•Search national databases for
data on each compound
•Eliminate compounds without
therapeutic uses
•Split list into toxic and non-
toxic compounds
•Search for abstracts about
adverse effects on skin by
each compound
NLP of Each
Abstract
•Tokenize words and
sentences in the text
•Process word frequencies for
top 5 words, compound
name, and skin-related words
•Calculate in-sentence
collocations for contextual
relevance
Data Analysis
•Eliminate duplicates and
abstracts with no mentions of
compound name or skin
•Sort by collocation ratio of
compound name and skin-
related words
•Examine the top 5 words of
each abstract and decide
whether to read it in full

Preparing the Data
Started with a list of ingredients for cosmetic
products (cosing) totaling 28,004
compounds1
With this list:
◦ Queried PubChem database for compounds
◦ Removed any cosing with no therapeudic uses
◦ Made a list of toxic compounds
◦ Made a list of non-toxic compounds
For each compound:
◦ Queried PubMed database for abstracts
◦ Downloaded abstract text to do NLP
5
1 Downloaded from the EU Open Data Portal at
https://data.europa.eu/euodp/en/data/dataset/cosmetic-ingredient-database-ingredients-and-fragrance-inventory

PubChem
Database2
Data for each compound includes:
◦ Therapeutic uses
◦ Toxicity
Queried the PubChem "PUG View"
web service for each compound &
processed XML results into CSV files:
◦ Toxic w/ therapeutic uses
◦ Non-toxic w/ therapeutic uses
6
2 PubChem, National Library of Medicine, National Center for Biotechnology Information, https://pubchem.ncbi.nlm.nih.gov/

PubMed
Database3
Built and ran a program to query PubMed
“Entrez” E-Utilities for research abstracts
and obtain their IDs
Query URL:
https://eutils.ncbi.nlm.nih.gov/en
trez/eutils/esearch.fcgi?db=&term=
"methanol""adverse effect""skin"
Downloaded each abstract found
Abstract URL:
https://eutils.ncbi.nlm.nih.gov/en
trez/eutils/efetch.fcgi?db=PubMed&
retmode=xml&ID=17613130
7
3 PubMed, National Library of Medicine, National Center for Biotechnology Information, https://pubmed.ncbi.nlm.nih.gov/

Natural Language Processing
Natural Language Toolkit (NLTK)4 is a Python library for basic NLP functions:
Tokenization – Splitting natural language into smaller units for easier processing
◦ We split abstract text into word tokens and sentence tokens
Frequency Distributions – Basically, counting words in the text
◦ After removing stop words (“a”, “the”, “it”, etc.), we got the 5 most common words in each abstract
Collocations – Multiple words appearing together in the text, such as “technical college”
◦ Expanded the scope to check entire sentences for compound names and skin words appearing together
8
4 Natural Language Toolkit, https://www.nltk.org/

1
2
3
4
5
6
7
8
9
10
11
12
import nltk
nltk.download('punkt')
sentence_detector = nltk.data.load("tokenizers/punkt/english.pickle")
def tokenizeAbstract(abstractText):
abstractTokens = []
words = nltk.word_tokenize(abstractText)
words = [w.lower() for w in words]
sentences = sentence_detector.tokenize(abstractText.strip())
abstractTokens.append((sentences, words))
return abstractTokens
Tokenizing by Words and Sentences
9
Tokenize all words in lowercase
(lines 8-9)
Prepare a pre-trained model for
detecting and tokenizing sentences
(lines 3-4)
Return sentence and word lists
(lines 11-12)
Remove whitespace around the
text and tokenize sentences
(lines 10)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from nltk.probability import FreqDist
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words("english")
mystopwords = [".", ",", "(", ")","%",";",":"]
stopwords.extend(mystopwords)
def getTopFiveWords(words):
filtered_words = []
for w in words:
if w not in stopwords:
filtered_words.append(w)
freqDist = nltk.probability.FreqDist(filtered_words)
return freqDist.most_common(5)
Top 5 Words Frequency Distribution
10
Prepare English stop words
with punctuation added
(lines 3-6)
Filter out stop words and create
the Frequency Distribution
(lines 11-15)
Return list of top 5 words
(line 17)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def getSkinAndCompoundNameCollocations(compound_name, sentences):
collocations = 0
skin_words = ["skin","dermal","epidermis"]
for sentence in sentences:
sentence = sentence.lower()
skin_found = False
for w in skin_words:
if sentence.find(w) > -1:
skin_found = True
break
if skin_found and sentence.find(compound_name.lower()) > -1:
collocations += 1
return collocations / len(sentences)
Calculating Sentence Collocation Ratios
11
Skin-related words to search for in each sentence
Search each sentence for the skin-related words
(lines 9-12)
Count the sentence if it also has compound name
(lines 14-15)
Return a ratio of collocation sentences to total sentences

Analyzing the Data
CSV results were analyzed in Microsoft Excel:
◦ Found duplicate abstracts that appeared for more than one compound
◦ Filtered out abstracts with no compound name, skin-related words, or collocations
◦ Chose abstracts to manually review based on the relevance of top 5 words
12
Toxic Compounds Non-Toxic Compounds Total
Total abstracts 287 261 548
Unique abstracts 221 80 301
Abstracts w/ collocations 49 2 51
Relevant abstracts 16 0 16

Conclusion
☆ Our method searched for 28,004 cosmetic ingredients, analyzed 301 unique abstracts, and
found 16 relevant studies about adverse effects on human skin.
Zero relevant results came from non-toxic compounds.
Querying the PubChem and PubMed databases took 15-30 minutes due to network restrictions.
NLP on all abstract texts completed in less than a second.
Using NLP can greatly reduce the overhead of literature reviews for complicated subjects.
13

Thank You!
Source code available on GitHub:
https://github.com/rsonger/CosIng-Toxicity
15

A Natural Language Processing Approach to Reviewing Research Abstracts

Recommended

Recommended

More Related Content

Similar to A Natural Language Processing Approach to Reviewing Research Abstracts

Similar to A Natural Language Processing Approach to Reviewing Research Abstracts (20)

Recently uploaded

Recently uploaded (20)

A Natural Language Processing Approach to Reviewing Research Abstracts