Public repositories containing diverse chemical and biological data are one of the main sources of knowledge for further biomedical research. Unfortunately, extraction and transforming these data into a well-interpretable form is a complex exercise. Ongoing efforts of a community are mainly focused on the analysis of co-occurrences of terms, text annotation based on terms similarity and related tasks [1].
Here we present an approach based on natural-language processing techniques, which is intended to shift the focus of a search for similar texts on chemical topics from word- to document-level. PubMed records were used to implement word2vec and doc2vec models. Generated text representations can be used to search for similar abstracts; the similarity is more dependent on this representation than the co-presence of certain terms (neighbor compounds, similar publication date, etc.).
Document-level clustering was also implemented to provide insight into the PubMed text corpus structure. This approach can serve as an alternative to standard topic modeling techniques for the discovery of hidden semantic features in an unsupervised manner.
3. Object: PubMed
• 30 million citations
• Search: context, author, article…
• … clinical queries, similar articles
4. PubMed similarity search
• “…The similarity between documents is measured by the words they
have in common, with some adjustment for document lengths. To
carry out such a program, one must first define what a word is. For us,
a word is basically an unbroken string of letters and numerals with at
least one letter of the alphabet in it…”
• “…Each time a word is used, it is assigned a numerical weight. This
numerical weight is based on information that the computer can
obtain by automatic processing…”
5. PubMed similarity search – alternative ways
• Kim S, Fiorini N, Wilbur WJ, Lu Z. Bridging the gap: Incorporating a
semantic similarity measure for effectively mapping PubMed queries
to documents. J Biomed Inform. 2017;75:122–127.
doi:10.1016/j.jbi.2017.09.014
• Peng S, You R, Wang H, Zhai C, Mamitsuka H, Zhu S. DeepMeSH: deep
semantic representation for improving large-scale MeSH
indexing. Bioinformatics. 2016;32(12):i70–i79.
doi:10.1093/bioinformatics/btw294
• …
6. NLP and word embeddings
From www.towardsdatascience.com
9. Why should we use specific embedding?
Mitrofanov et al., Neural network tool for dialog semantic analysis and quality control management. In publishing…
11. Word vectors as a universal tool: general
• Vectors can be processed using simple math:
• Context search
• Similarity search
• ‘Semantic math’
12. Word vectors as a universal tool: similarity
search
Article: PubMed 6125428. Maternal and public health benefits
of menstrual regulation in Chittagong.
Abstract: In Bangladesh abortions induced by untrained
traditional birth attendants cause high levels of morbidity and
mortality. A 2-year program was initiated to provide early
atraumatic termination of pregnancy at Chittagong Medical
College Hospital on an outpatient basis. The effect of the
program on hospital admissions for septic induced abortion
was studied by reviewing hospital records. In 2 years hospital
admission for induced abortion decreased by 72%, and bed
days used for treatment of induced abortion declined by 75%.
Deaths occurring in the hospital that were attributed to
induced abortion remained at a low level but were not
eliminated.
PubMed Id Cosine
similarity
15814392 0.8059433
12338521 0.79276556
15750566 0.7777994
…. ...
29335199 -0.5755786
29256494 -0.60501575
29311511 -0.6054055
PubMed Id Cosine
similarity
12345324 0.60527635
6119248 0.713412
12338521 0.79276556
PubMed results Our results
13. Word vectors as a universal tool: similarity
search
Article: PubMed 6146009. Acquired immunodeficiency
syndrome in a heterosexual population in Zaire.
Abstract: 38 patients with the acquired immunodeficiency
syndrome (AIDS) were identified in Kinshasa, Zaire, during a 3
week period in 1983. The male to female ratio was 1.1:1. The
annual case rate for Kinshasa was estimated to be at least 17
per 100 000. [...].
PubMed Id Cosine
similarity
6146008 0.928879
15964962 0.8923047
16318255 0.8743412
…. ...
30503650 -0.6103743
29608852 -0.61208564
30004552 -0.62419856
PubMed Id Cosine
similarity
6146008 0.928879
8236386 0.75797963
2867324 0.7872803
PubMed results Our results
14. Cosine similarity
FIrst abstract: Tubeculosis in Abidjan:
comparison of children and adults.
25323777
Second abstract: Impact of HIV infection
on the development, clinical
presentation, and outcome of
tuberculosis among children in Abidjan,
Côte d'Ivoire. 9233463
FIrst abstract: Tubeculosis in Abidjan:
comparison of children and adults.
25323777
Second abstract: Advancing synthetic
therapies for the treatment of restless
legs syndrome. 31424287
Cosine similarity:
0.89011306
Cosine similarity:
0.3947778
FIrst abstract: Tubeculosis in Abidjan:
comparison of children and adults.
25323777
Second abstract: Crystal structure of
bacterial CYP116B5 heme domain: New
insights on class VII P450s structural
flexibility and peroxygenase activity.
31430491
Cosine similarity:
-0.36352736
15. Word vectors as a universal tool: context
search
Target article- PubMed 7320548. Abnormal left ventricular configuration and
contraction in patients with mitral stenosis: a cross-sectional
echocardiographic study.
Target abstract - To assess left ventricular shape and contraction pattern
under the condition of the narrowed mitral orifice, 41 patients with mitral
stenosis were studied by cross-sectional and M-mode echocardiography. […].
This study showed that abnormal left ventricular shape and asynergy in the
posterior wall are not rare in patients with mitral stenosis and are due to a
rigid mitral complex and associated tricuspid regurgitation.
Target word - mitral, window width = 2
Best article - PubMed 7449460. Cross-sectional echocardiographic observations
on the mechanism of preservation of the opening snap in calcific mitral stenosis.
Best abstract - Cross-sectional echocardiography (CSE) was performed on 19
patients with mitral stenosis (MS). Nine patients (group 1) had little or no mitral
valve (MV) calcification; ten patients (group 2) had heavy MV calcification. An
opening snap was present in each of the patients in group 1. Of the group 2
patients, five (group 2A) had an opening snap and five (group 2B) did not [...].
Cosine similarity = 0.810
Pubmed id 7449460 7566936 7363426 ... 7534881 7586506 7575329
Cosine similarity 0.810 0.806 0.798 ... -0.098 -0.099 -0.148
16. Word vectors as a universal tool: context
search
Target article- PubMed 7320548. Abnormal left ventricular configuration and
contraction in patients with mitral stenosis: a cross-sectional
echocardiographic study.
Target abstract - To assess left ventricular shape and contraction pattern
under the condition of the narrowed mitral orifice, 41 patients with mitral
stenosis were studied by cross-sectional and M-mode echocardiography. […].
This study showed that abnormal left ventricular shape and asynergy in the
posterior wall are not rare in patients with mitral stenosis and are due to a
rigid mitral complex and associated tricuspid regurgitation.
Target word - mitral, window width = 10
Best article - PubMed 7566936. Rheumatic fever, a disease undergoing change,
based on the experience of the past 15 years.
Best abstract - [...]. Two children had chorea minor with carditis and there were 2
additional cases with chorea minor only. Valvular heart disease has developed in
21 patients. There were 16 patients with mitral regurgitation, these conditions
have occurred in combination with mitral stenosis in 3 cases, whereas the
insufficientia of the aortic valve developed in two patients. [...].
Cosine similarity = 0.860
Pubmed id 7566936 7377092 7371133 ... 7552248 7534881 7575329
Cosine similarity 0.860 0.842 0.830 ... 0.013 -0.008 -0.033
18. Word vectors as a universal tool: semantic
math
Article: PubMed 7320349. The study of reciprocal influences
through experimental modification of social interaction between
functional adult-child pairs.
Abstract: Studies using the Functional Pairs Approach to the study
of socialization processes are reviewed, and its strengths and
weaknesses are discussed. By staging social encounters between
children and biologically unrelated adults, this approach can
achieve excellent isolation of causal effects involving a wide range
of behaviors. [...].
Article: PubMed 7320347. The use of psychopharmacology to
study reciprocal influences in parent-child interaction.
Abstract: The present paper reviews the possible utility and
limitations of using behavior-modifying drugs to study reciprocal
influences in parent-child interactions. Ideal circumstances for use
of this approach are outlined and contrasted with the current
status of the field of psychopharmacology. [...].
Only 7320349 Only 7320347 Mean Difference
11081695 12422555 7320349 9239599
8306142 9246227 7320347 11274897
8605823 10674189 9635219 8282203
... ... ... ...
8769308 7780815 9685863 10984914
10670578 7539088 10670578 7592196
8536693 10342279 8536693 7397482
19. Why do we need all the math?
• Clustering
• Extracting new concepts (synthesis routes?)
• drug-target-disease