DDH 2021-03-03: Text Processing and Searching in the Medical Domain
1. Gianmaria Silvello


Department of Information Engineering


University of Padua
gianmaria.silvello@unipd.it
http://www.dei.unipd.it/~silvello/
@giansilv
Intelligent Interactive Information Access Hub
Text Processing and
Searching in the
Medical Domain
4. Text processing pipeline
Tokenization
T
It is a way of separating a piece of
text into smaller units called tokens
Word Tokenization

Pros: Very common, Use of pre-
trained embeddings
Cons: Out Of Vocabulary (OOV)
Cons: Large vocabulary
Char Tokenization

Pros: No OOV, limit vocabulary size
Cons: Hard to get the relation
between chars and word meaning
N-Gram Tokenization

Pros: No OOV, limit vocabulary size
Cons: Computationally more
expensive than word tokenisation
Tokenization
T
polyp distal sigma: tubulovilloso adenoma
with severe dysplasia/carcinoma in
intramucosal minor fragment. in increased
fragment aspects are severe glandular
epithelial dysplasia.
polyp: 1
distal: 1
sigma: 1
tubulovilloso:
1
adenoma: 1
with: 1
severe: 2
dysplasia: 1
carcinoma: 1
in: 2
intramucosal:
1
minor: 1
fragment: 2
increased: 1
aspects: 1
…
p: 6
o: 8
l: 12
y: 3
d:6
i: 14
…
pol: 2
oly: 2
lyp: 2
yp_: 2
…
Word char n-gram
5. Text processing pipeline
Tokenization
T
Stopword
s
removal
SR
Stopwords removal is a key step in
text processing
The distribution of words is not
statistical uniform


Luhn observation / Zipf’s Law




Stopwords
S
polyp distal sigma tubulovilloso adenoma
with severe dysplasia carcinoma in
intramucosal minor fragment in increased
fragment aspects are severe glandular
epithelial dysplasia
6. Text processing pipeline
Tokenization
T
Stopword
s
removal
SR S
Stemming
Stemming reduces the words to
their root
Rule-based stemmers for languages
with high linguistic resources (e.g.,
English)
Porter
Lovins
Paice


Statistical stemmers for languages
with low linguistic resources (e.g.,
Hindi)


FBC
GRASS
SNS
YASS
N-grams are an alternative to
stemming
Stemming
S
polyp distal sigma tubulovilloso adenoma
severe dysplasia carcinoma intramucosal
minor fragment increased fragment
aspects severe glandular epithelial
dysplasia
7. Text processing pipeline
Tokenization
T
Stopword
s
removal
SR S
Stemming
POS
Part-of-Speec
h
tagging
Words can be grouped into classes
referred to as Part of Speech (PoS)
or morphological classes




The word PoS provides crucial
information to determine the roles of
the word itself and of the words
close to it in the sentence




The 4 largest open classes of
words, present in most of the
languages, are
â–« nouns
â–« verbs
â–« adverbs
â–« adjectives




Rule-based POS
Probabilistic approaches (HMM)
Part-Of-Speech
POS
8. Text processing pipeline
Tokenization
T
Stopword
s
removal
SR S
Stemming
POS
Part-of-Speec
h
tagging
Named entity recognition (NER) ‒
also called entity
identi
fi
cation or entity extraction ‒ is
an information extraction technique
that automatically identi
fi
es named
entities in a text and classi
fi
es them
into prede
fi
ned categories.






Lexicon approach
Rule-based systems
Machine learning-based systems
Hybrid approach
Named Entity Recognition
NER
NER
Named Entity
Recognition
polyp distal sigma tubulovilloso adenoma
severe dysplasia carcinoma intramucosal
minor fragment increased fragment
aspects severe glandular epithelial
dysplasia
9. Text processing pipeline
Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy
10. Text processing pipeline
Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy
Tokenization
T
11. Text processing pipeline
Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy
<Colon, biopsy, spots, scar, of, previous, polypectomy, and,
colonic, mucosa, fragment, with,
fi
brosis, no, evidence,
in
fl
ammation, dysplasia, or, malignancy>
Stopword
s
removal
SR
Stopwords removal might be
problematic.
Here we’d lost the negation
A similar problem might occur
with stemming
12. Text processing pipeline
Let’s consider a short medical report about a colon biopsy
POS
Part-of-Speec
h
tagging
Colon biopsy spots scar of previous
polypectomy and colonic mucosa
fragment with
fi
brosis. no evidence
in
fl
ammation, dysplasia or malignancy.
15. Term representation
Vector representations are central both for in many
applications (e.g., IR and Machine Learning)
Usually, we focus on terms as the smallest unit of
representation
…but, we may also consider n-grams
Different representations lead to different notions of
similarities
… and different properties of “compositionality"
to build passage or document representations
16. Local representation of terms
Local (one-hot vector) representation
There is a
fi
xed vocabulary V
Size of the vectors is |V|
1 means the term is present, 0 means the term is not there
banana is a fruit
0 1 2 3
sentence
term index
[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
banana
is
a
fruit
18. Local representation
0 1 2 3 4
sentence
term index
banana is a fruit as well as mango; dog is an animal
19. Local representation
0 1 2 3 4
sentence
term index
banana is a fruit as well as mango; dog is an animal
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
banana
fruit
mango
dog
[0 0 0 0 1]
animal
We cannot use this representation to de
fi
ning similarity between terms
High-dimensionality
Sparse
With TF-IDF weighting the problem does not change
20. Local representation
0 1 2 3 4
sentence
term index
banana is a fruit as well as mango; dog is an animal
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
banana
fruit
mango
dog
[0 0 0 0 1]
animal
We cannot use this representation to de
fi
ning similarity between terms
High-dimensionality
Sparse
With TF-IDF weighting the problem does not change
We are ignoring context
21. Distributed representation
Under distributed representations every term is
represented by a vector
A vector of hand-crafted features or a learnt
representation in which the individual dimensions are
not interpretable in isolation.
Use an implicit notion of similarity between words:
“banana” is more similar to “mango” than “dog” because they
are both fruits, but yet different because of other properties that
are not shared between the two, such as shape
22. Distributed representation
Distributional Hypothesis: terms that are used (or
occur) in similar context tend to be semantically similar
[1]
Distributional Semantics: a word is characterised by
the company it keeps [2]
[1] Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162.
[2] John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. (1957).
23. Distributed representation (with context)
[1 0 1 0 0 0 0 1 1 0 0 0 0 … 0]
banana
[d0 - d2 - - - - d7 d8 - - - - … -]
in-document representation [1]
[1] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman.
1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391–407.
[2] Geo
ff
rey E Hinton. 1984. Distributed representations. (1984).
[1 1 0 0 0 0 1 0 0 0 0 1 … 0]
banana
[
fl
ies fruit - - - - a - - - - like …] neighbouring-word features
banana
[
fl
ies fruit - - - - a - - - - like …]
neighbouring-word w/
distance features
[-3 -2 - - - - +1 - - - - +3 …]
banana
[- #ba na# - - - ana - nan - - ban …]
n-grams (e.g., 3-grams)
[0 1 1 0 0 0 1 0 1 0 0 1 …]
24. Compositionality
A document (or a sentence) can be represented as an
aggregation of distributed representations
sum of the word vectors
average of the word vectors
…
25. thanks to https://www.shanelynn.ie
We build models of meaning
focusing on similarity
Each word = a vector
Similar words are "nearby in
space"
We de
fi
ne a word as a vector
called an "embedding" because
it's embedded into a space
The standard way to represent
meaning in NLP
Similarity-based representations
26. How do we de
fi
ne similarity?
Bhaskar Mitra, Nick Craswell: Neural Models for Information Retrieval.Â
CoRRÂ abs/1705.01509Â (2017)
27. How do we de
fi
ne similarity?
Typicality (paradigmatic)
neighbouring-word w/ distance features
Topicality (syntagmatic)
in-document representation
29. Term embeddings
Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford)
(distributional semantic model)
(semantic vector space)
https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
(Word)
30. Embeddings
An embedding is a representation of items in a new
space s.t. the properties of the items are respected
For terms we can have sparse/explicit or dense/implicit
representations. Dense/implicit based on feature
predicting models are
easier to use as features in machine learning (less weights to
tune)
often perform better than embeddings based on explicit
counting (dense vectors may generalize better)
better at capturing synonymy
easy to visualise
but, hard to interpret, if it is even possible
33. Word2Vec
Popular embedding method
Very fast to train
Idea: predict rather than count
Instead of counting how often each word w2 occurs
near w1
Train a classi
fi
er on a binary prediction task:
Is w2 likely to show up near w1?
We don’t actually care about this task
But we'll take the learned classi
fi
er weights as the word
embeddings
34. Word2Vec
Popular embedding method
Very fast to train
Idea: predict rather than count
Instead of counting how often each word w2 occurs
near w1
Train a classi
fi
er on a binary prediction task:
Is w2 likely to show up near w1?
We don’t actually care about this task
But we'll take the learned classi
fi
er weights as the word
embeddings
Word2Vec (W2V) comes in two versions
implementing two di
ff
erent learning
architectures:
1) Skip-Gram
2) Continuous-Bag-Of-Words (CBOW)
35. Word2Vec: Skip-Gram
Skip-gram algorithm
1. Treat the target word and a neighboring context word as positive
examples


2. Randomly sample other words in the lexicon to get negative
samples


3. Use logistic regression to train a classi
fi
er to distinguish those
two cases


4. Use the weights as the embeddings
36. Word2Vec: Skip-Gram
Skip-gram algorithm
1. Treat the target word and a neighboring context word as positive
examples


2. Randomly sample other words in the lexicon to get negative
samples


3. Use logistic regression to train a classi
fi
er to distinguish those
two cases


4. Use the weights as the embeddings
In statistics, the logistic model (or logit
model) is a widely used statistical model
that, in its basic form, uses a logistic
function to model a binary dependent
variable
38. fasttext (Facebook)
FastText is a library created by the Facebook Research
Team for ef
fi
cient learning of word representations
and sentence classi
fi
cation
FastText assumes a word to be formed by a n-grams of
character
for example, sunny is composed of [sun, sunn,sunny],
[sunny,unny,nny] Â etc,
where n could range from 1 to the length of the word.
39. fasttext
This new representation of word by fastText provides
the following bene
fi
ts over word2vec or glove:
Rare words: It allows us to
fi
nd the vector representation for
rare words. Since rare words could still be broken into character
n-grams, they could share these n-grams with the common
words.
OOV: It can give the vector representations for the words not
present in the dictionary (OOV words) since these can also be
broken down into character n-grams.
character n-grams embeddings tend to perform superior to
word2vec and glove on smaller datasets.
https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classi
fi
cation-using-fasttext-nlp-facebook/
41. Caveats: Training and Compositionality
Training dense vectors could very expensive (time and
resources)
A simple solution: Use pre-trained vectors
Fasttext: https://fasttext.cc/docs/en/unsupervised-tutorial.html
Word2Vec: https://code.google.com/archive/p/word2vec/
Once we have the word embeddings we can build a
document (paragraph, sentence) embedding by putting
the word embeddings together
Average the vectors
Sum the vectors
42. Medical Word Embeddings
Word embeddings often ignore the internal structure of
words and external knowledge
Knowledge bases and ontologies can be really
important in specialised domains (like the biomedical
one)
There are speci
fi
c word embeddings for the biomedical
domain, e.g. BioWordVec (Scienti
fi
c Data, 2019) or
BioBert (Lee et al., 2019)
43. Medical Word Embeddings
A crucial success factor is where the embeddings are
trained
There is no one-
fi
ts-all solution and general neural
networks-based solutions generally do not outperform
traditional solutions
Nevertheless, there is a wide availability of biomedical
training data and neural networks are groundbreaking
for text processing (and related tasks)