DDH 2021-03-03: Text Processing and Searching in the Medical Domain

Gianmaria Silvello

 
Department of Information Engineering
 
University of Padua

gianmaria.silvello@unipd.it

http://www.dei.unipd.it/~silvello/
@giansilv
Intelligent Interactive Information Access Hub

Text Processing and
Searching in the

Medical Domain

Outline
Introduction to Text Processing

Text processing typical pipeline

Word Embeddings

Text Processing
Image from: https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958

Text processing pipeline
Tokenization
T
It is a way of separating a piece of
text into smaller units called tokens

Word Tokenization 
Pros: Very common, Use of pre-
trained embeddings
Cons: Out Of Vocabulary (OOV)

Cons: Large vocabulary

Char Tokenization 
Pros: No OOV, limit vocabulary size
Cons: Hard to get the relation
between chars and word meaning

N-Gram Tokenization 
Pros: No OOV, limit vocabulary size
Cons: Computationally more
expensive than word tokenisation
Tokenization
T
polyp distal sigma: tubulovilloso adenoma
with severe dysplasia/carcinoma in
intramucosal minor fragment. in increased
fragment aspects are severe glandular
epithelial dysplasia.
polyp: 1

distal: 1

sigma: 1

tubulovilloso:
1

adenoma: 1

with: 1

severe: 2

dysplasia: 1

carcinoma: 1

in: 2

intramucosal:
1

minor: 1

fragment: 2

increased: 1

aspects: 1

…
p: 6

o: 8

l: 12

y: 3

d:6

i: 14

…
pol: 2

oly: 2

lyp: 2

yp_: 2

…
Word char n-gram

Tokenization
T
Stopword
s

removal
SR
Stopwords removal is a key step in
text processing

The distribution of words is not
statistical uniform

 
Luhn observation / Zipf’s Law
 
 
Stopwords
S
polyp distal sigma tubulovilloso adenoma
with severe dysplasia carcinoma in
intramucosal minor fragment in increased
fragment aspects are severe glandular
epithelial dysplasia

Tokenization
T
Stopword
s

removal
SR S
Stemming
Stemming reduces the words to
their root

Rule-based stemmers for languages
with high linguistic resources (e.g.,
English)

Porter

Lovins

Paice

 
Statistical stemmers for languages
with low linguistic resources (e.g.,
Hindi)
 
FBC

GRASS

SNS

YASS

N-grams are an alternative to
stemming
Stemming
S
severe dysplasia carcinoma intramucosal
minor fragment increased fragment
aspects severe glandular epithelial
dysplasia

Tokenization
T
Stopword
s

removal
SR S
Stemming
POS
Part-of-Speec
h

tagging
Words can be grouped into classes
referred to as Part of Speech (PoS)
or morphological classes
 
 
The word PoS provides crucial
information to determine the roles of
the word itself and of the words
close to it in the sentence
 
 
The 4 largest open classes of
words, present in most of the
languages, are

▫ nouns

▫ verbs

▫ adverbs

▫ adjectives
 
 
Rule-based POS

Probabilistic approaches (HMM)
Part-Of-Speech
POS

Tokenization
T
Stopword
s

removal
SR S
Stemming
POS
Part-of-Speec
h

tagging
Named entity recognition (NER) ‒
also called entity
identi
fi
cation or entity extraction ‒ is
an information extraction technique
that automatically identi
fi
es named
entities in a text and classi
fi
es them
into prede
fi
ned categories.
 
   
Lexicon approach

Rule-based systems

Machine learning-based systems

Hybrid approach

Named Entity Recognition
NER
NER
Named Entity
Recognition
severe dysplasia carcinoma intramucosal
minor fragment increased fragment
aspects severe glandular epithelial
dysplasia

Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy

fi
fl
ammation,
Tokenization
T

fi
fl
ammation,
<Colon, biopsy, spots, scar, of, previous, polypectomy, and,
colonic, mucosa, fragment, with,
fi
brosis, no, evidence,
in
fl
ammation, dysplasia, or, malignancy>
Stopword
s

removal
SR
Stopwords removal might be
problematic.

Here we’d lost the negation
A similar problem might occur
with stemming

POS
Part-of-Speec
h

tagging
Colon biopsy spots scar of previous
polypectomy and colonic mucosa
fragment with
fi
brosis. no evidence
in
fl
ammation, dysplasia or malignancy.

NER
Named Entity
Recognition

Term representation
Vector representations are central both for in many
applications (e.g., IR and Machine Learning)

Usually, we focus on terms as the smallest unit of
representation

…but, we may also consider n-grams

Different representations lead to different notions of
similarities

… and different properties of “compositionality"
to build passage or document representations

Local representation of terms
Local (one-hot vector) representation

There is a
fi
xed vocabulary V

Size of the vectors is |V|

1 means the term is present, 0 means the term is not there
banana is a fruit
0 1 2 3
sentence
term index
[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
banana
is
a
fruit

Local representation
banana is a fruit as well as mango; dog is an animal
0 1 2 3 4
sentence
term index

0 1 2 3 4
sentence
term index

0 1 2 3 4
sentence
term index
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
banana
fruit
mango
dog
[0 0 0 0 1]
animal
We cannot use this representation to de
fi
ning similarity between terms

High-dimensionality

Sparse

With TF-IDF weighting the problem does not change

0 1 2 3 4
sentence
term index
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
banana
fruit
mango
dog
[0 0 0 0 1]
animal
We cannot use this representation to de
fi
ning similarity between terms

High-dimensionality

Sparse

With TF-IDF weighting the problem does not change
We are ignoring context

Distributed representation
Under distributed representations every term is
represented by a vector

A vector of hand-crafted features or a learnt
representation in which the individual dimensions are
not interpretable in isolation.

Use an implicit notion of similarity between words:

“banana” is more similar to “mango” than “dog” because they
are both fruits, but yet different because of other properties that
are not shared between the two, such as shape

Distributed representation
Distributional Hypothesis: terms that are used (or
occur) in similar context tend to be semantically similar
[1]

Distributional Semantics: a word is characterised by
the company it keeps [2]
[1] Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162.

[2] John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. (1957).

Distributed representation (with context)
[1 0 1 0 0 0 0 1 1 0 0 0 0 … 0]
banana
[d0 - d2 - - - - d7 d8 - - - - … -]
in-document representation [1]
[1] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman.
1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391–407.

[2] Geo
ff
rey E Hinton. 1984. Distributed representations. (1984).
[1 1 0 0 0 0 1 0 0 0 0 1 … 0]
banana
[
fl
ies fruit - - - - a - - - - like …] neighbouring-word features
banana
[
fl
ies fruit - - - - a - - - - like …]
neighbouring-word w/
distance features
[-3 -2 - - - - +1 - - - - +3 …]
banana
[- #ba na# - - - ana - nan - - ban …]
n-grams (e.g., 3-grams)
[0 1 1 0 0 0 1 0 1 0 0 1 …]

Compositionality
A document (or a sentence) can be represented as an
aggregation of distributed representations

sum of the word vectors

average of the word vectors

…

thanks to https://www.shanelynn.ie
We build models of meaning
focusing on similarity

Each word = a vector

Similar words are "nearby in
space"

We de
fi
ne a word as a vector

called an "embedding" because
it's embedded into a space

The standard way to represent
meaning in NLP
Similarity-based representations

How do we de
fi
ne similarity?
Bhaskar Mitra, Nick Craswell: Neural Models for Information Retrieval.

CoRR abs/1705.01509 (2017)

How do we de
fi
ne similarity?
Typicality (paradigmatic)
neighbouring-word w/ distance features
Topicality (syntagmatic)
in-document representation

Term embeddings
Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford)

Term embeddings
Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford)
(distributional semantic model)
(semantic vector space)
https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
(Word)

Embeddings
An embedding is a representation of items in a new
space s.t. the properties of the items are respected

For terms we can have sparse/explicit or dense/implicit
representations. Dense/implicit based on feature
predicting models are

easier to use as features in machine learning (less weights to
tune)

often perform better than embeddings based on explicit
counting (dense vectors may generalize better)

better at capturing synonymy

easy to visualise

but, hard to interpret, if it is even possible

Dense embeddings (code and vectors)
Word2Vec: https://code.google.com/archive/p/
word2vec/

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
Jeffrey Dean. Distributed Representations of Words and Phrases
and their Compositionality. In Proceedings of NIPS, 2013.
Fasttext http://www.fasttext.cc/

A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T.
Mikolov. 2016. FastText.zip: Compressing text classi
fi
cation
models
Glove http://nlp.stanford.edu/projects/glove/

Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. GloVe: Global Vectors for Word Representation.

https://nlp.stanford.edu/pubs/glove.pdf

Dense embeddings (frameworks)
Distributional semantics (pyDSM): https://
github.com/jimmycallin/pydsm
Semantic vectors (Java): https://github.com/
semanticvectors/semanticvectors
Semantic spaces (Java): https://github.com/
fozziethebeat/S-Space
Semantic spaces (Python): http://clic.cimec.unitn.it/
composes/toolkit/
Deep Learning 4 Java: https://deeplearning4j.org/
Learn about Apache Maven
fi
rst

Word2Vec
Popular embedding method

Very fast to train

Idea: predict rather than count

Instead of counting how often each word w2 occurs
near w1

Train a classi
fi
er on a binary prediction task:

Is w2 likely to show up near w1?

We don’t actually care about this task

But we'll take the learned classi
fi
er weights as the word
embeddings

Word2Vec
Popular embedding method

Very fast to train

Idea: predict rather than count

Instead of counting how often each word w2 occurs
near w1

Train a classi
fi
er on a binary prediction task:

Is w2 likely to show up near w1?

We don’t actually care about this task

But we'll take the learned classi
fi
er weights as the word
embeddings
Word2Vec (W2V) comes in two versions
implementing two di
ff
erent learning
architectures:
1) Skip-Gram
2) Continuous-Bag-Of-Words (CBOW)

Word2Vec: Skip-Gram
Skip-gram algorithm

1. Treat the target word and a neighboring context word as positive
examples
 
2. Randomly sample other words in the lexicon to get negative
samples
 
3. Use logistic regression to train a classi
fi
er to distinguish those
two cases
 
4. Use the weights as the embeddings

Word2Vec: Skip-Gram
Skip-gram algorithm

1. Treat the target word and a neighboring context word as positive
examples
 
2. Randomly sample other words in the lexicon to get negative
samples
 
3. Use logistic regression to train a classi
fi
er to distinguish those
two cases
 
4. Use the weights as the embeddings
In statistics, the logistic model (or logit
model) is a widely used statistical model
that, in its basic form, uses a logistic
function to model a binary dependent
variable

fasttext (Facebook)
FastText is a library created by the Facebook Research
Team for ef
fi
cient learning of word representations
and sentence classi
fi
cation

FastText assumes a word to be formed by a n-grams of
character

for example, sunny is composed of [sun, sunn,sunny],
[sunny,unny,nny] etc,

where n could range from 1 to the length of the word.

fasttext
This new representation of word by fastText provides
the following bene
fi
ts over word2vec or glove:

Rare words: It allows us to
fi
nd the vector representation for
rare words. Since rare words could still be broken into character
n-grams, they could share these n-grams with the common
words.

OOV: It can give the vector representations for the words not
present in the dictionary (OOV words) since these can also be
broken down into character n-grams.

character n-grams embeddings tend to perform superior to
word2vec and glove on smaller datasets.
https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classi
fi
cation-using-fasttext-nlp-facebook/

Caveats: Training and Compositionality
Training dense vectors could very expensive (time and
resources)

A simple solution: Use pre-trained vectors

Fasttext: https://fasttext.cc/docs/en/unsupervised-tutorial.html

Word2Vec: https://code.google.com/archive/p/word2vec/

Once we have the word embeddings we can build a
document (paragraph, sentence) embedding by putting
the word embeddings together

Average the vectors

Sum the vectors

Medical Word Embeddings
Word embeddings often ignore the internal structure of
words and external knowledge

Knowledge bases and ontologies can be really
important in specialised domains (like the biomedical
one)

There are speci
fi
c word embeddings for the biomedical
domain, e.g. BioWordVec (Scienti
fi
c Data, 2019) or
BioBert (Lee et al., 2019)

Medical Word Embeddings
A crucial success factor is where the embeddings are
trained

There is no one-
fi
ts-all solution and general neural
networks-based solutions generally do not outperform
traditional solutions

Nevertheless, there is a wide availability of biomedical
training data and neural networks are groundbreaking
for text processing (and related tasks)

DDH 2021-03-03: Text Processing and Searching in the Medical Domain

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DDH 2021-03-03: Text Processing and Searching in the Medical Domain

Similar to DDH 2021-03-03: Text Processing and Searching in the Medical Domain (20)

Recently uploaded

Recently uploaded (20)

DDH 2021-03-03: Text Processing and Searching in the Medical Domain