SlideShare a Scribd company logo
1 of 43
Download to read offline
Gianmaria Silvello




Department of Information Engineering


University of Padua


gianmaria.silvello@unipd.it


http://www.dei.unipd.it/~silvello/
@giansilv
Intelligent Interactive Information Access Hub


Text Processing and
Searching in the


Medical Domain
Outline
Introduction to Text Processing


Text processing typical pipeline


Word Embeddings
Text Processing
Image from: https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958
Text processing pipeline
Tokenization
T
It is a way of separating a piece of
text into smaller units called tokens


Word Tokenization

Pros: Very common, Use of pre-
trained embeddings
Cons: Out Of Vocabulary (OOV)


Cons: Large vocabulary


Char Tokenization

Pros: No OOV, limit vocabulary size
Cons: Hard to get the relation
between chars and word meaning


N-Gram Tokenization

Pros: No OOV, limit vocabulary size
Cons: Computationally more
expensive than word tokenisation
Tokenization
T
polyp distal sigma: tubulovilloso adenoma
with severe dysplasia/carcinoma in
intramucosal minor fragment. in increased
fragment aspects are severe glandular
epithelial dysplasia.
polyp: 1


distal: 1


sigma: 1


tubulovilloso:
1


adenoma: 1


with: 1


severe: 2


dysplasia: 1


carcinoma: 1


in: 2


intramucosal:
1


minor: 1


fragment: 2


increased: 1


aspects: 1


…
p: 6


o: 8


l: 12


y: 3


d:6


i: 14


…
pol: 2


oly: 2


lyp: 2


yp_: 2


…
Word char n-gram
Text processing pipeline
Tokenization
T
Stopword
s

removal
SR
Stopwords removal is a key step in
text processing


The distribution of words is not
statistical uniform




Luhn observation / Zipf’s Law




Stopwords
S
polyp distal sigma tubulovilloso adenoma
with severe dysplasia carcinoma in
intramucosal minor fragment in increased
fragment aspects are severe glandular
epithelial dysplasia
Text processing pipeline
Tokenization
T
Stopword
s

removal
SR S
Stemming
Stemming reduces the words to
their root


Rule-based stemmers for languages
with high linguistic resources (e.g.,
English)


Porter


Lovins


Paice




Statistical stemmers for languages
with low linguistic resources (e.g.,
Hindi)


FBC


GRASS


SNS


YASS


N-grams are an alternative to
stemming
Stemming
S
polyp distal sigma tubulovilloso adenoma
severe dysplasia carcinoma intramucosal
minor fragment increased fragment
aspects severe glandular epithelial
dysplasia
Text processing pipeline
Tokenization
T
Stopword
s

removal
SR S
Stemming
POS
Part-of-Speec
h

tagging
Words can be grouped into classes
referred to as Part of Speech (PoS)
or morphological classes




The word PoS provides crucial
information to determine the roles of
the word itself and of the words
close to it in the sentence




The 4 largest open classes of
words, present in most of the
languages, are


â–« nouns


â–« verbs


â–« adverbs


â–« adjectives




Rule-based POS


Probabilistic approaches (HMM)
Part-Of-Speech
POS
Text processing pipeline
Tokenization
T
Stopword
s

removal
SR S
Stemming
POS
Part-of-Speec
h

tagging
Named entity recognition (NER) ‒
also called entity
identi
fi
cation or entity extraction ‒ is
an information extraction technique
that automatically identi
fi
es named
entities in a text and classi
fi
es them
into prede
fi
ned categories.






Lexicon approach


Rule-based systems


Machine learning-based systems


Hybrid approach


Named Entity Recognition
NER
NER
Named Entity
Recognition
polyp distal sigma tubulovilloso adenoma
severe dysplasia carcinoma intramucosal
minor fragment increased fragment
aspects severe glandular epithelial
dysplasia
Text processing pipeline
Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy
Text processing pipeline
Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy
Tokenization
T
Text processing pipeline
Colon biopsy spots scar of previous polypectomy and colonic
mucosa fragment with
fi
brosis. no evidence in
fl
ammation,
dysplasia or malignancy.
Let’s consider a short medical report about a colon biopsy
<Colon, biopsy, spots, scar, of, previous, polypectomy, and,
colonic, mucosa, fragment, with,
fi
brosis, no, evidence,
in
fl
ammation, dysplasia, or, malignancy>
Stopword
s

removal
SR
Stopwords removal might be
problematic.


Here we’d lost the negation
A similar problem might occur
with stemming
Text processing pipeline
Let’s consider a short medical report about a colon biopsy
POS
Part-of-Speec
h

tagging
Colon biopsy spots scar of previous
polypectomy and colonic mucosa
fragment with
fi
brosis. no evidence
in
fl
ammation, dysplasia or malignancy.
Text processing pipeline
Let’s consider a short medical report about a colon biopsy
NER
Named Entity
Recognition
Behind the pipeline…
Term representation
Vector representations are central both for in many
applications (e.g., IR and Machine Learning)


Usually, we focus on terms as the smallest unit of
representation


…but, we may also consider n-grams


Different representations lead to different notions of
similarities


… and different properties of “compositionality"
to build passage or document representations
Local representation of terms
Local (one-hot vector) representation


There is a
fi
xed vocabulary V


Size of the vectors is |V|


1 means the term is present, 0 means the term is not there
banana is a fruit
0 1 2 3
sentence
term index
[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
banana
is
a
fruit
Local representation
banana is a fruit as well as mango; dog is an animal
0 1 2 3 4
sentence
term index
Local representation
0 1 2 3 4
sentence
term index
banana is a fruit as well as mango; dog is an animal
Local representation
0 1 2 3 4
sentence
term index
banana is a fruit as well as mango; dog is an animal
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
banana
fruit
mango
dog
[0 0 0 0 1]
animal
We cannot use this representation to de
fi
ning similarity between terms


High-dimensionality


Sparse


With TF-IDF weighting the problem does not change
Local representation
0 1 2 3 4
sentence
term index
banana is a fruit as well as mango; dog is an animal
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
banana
fruit
mango
dog
[0 0 0 0 1]
animal
We cannot use this representation to de
fi
ning similarity between terms


High-dimensionality


Sparse


With TF-IDF weighting the problem does not change
We are ignoring context
Distributed representation
Under distributed representations every term is
represented by a vector


A vector of hand-crafted features or a learnt
representation in which the individual dimensions are
not interpretable in isolation.


Use an implicit notion of similarity between words:


“banana” is more similar to “mango” than “dog” because they
are both fruits, but yet different because of other properties that
are not shared between the two, such as shape
Distributed representation
Distributional Hypothesis: terms that are used (or
occur) in similar context tend to be semantically similar
[1]


Distributional Semantics: a word is characterised by
the company it keeps [2]
[1] Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162. 

[2] John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. (1957).
Distributed representation (with context)
[1 0 1 0 0 0 0 1 1 0 0 0 0 … 0]
banana
[d0 - d2 - - - - d7 d8 - - - - … -]
in-document representation [1]
[1] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman.
1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391–407.

[2] Geo
ff
rey E Hinton. 1984. Distributed representations. (1984).
[1 1 0 0 0 0 1 0 0 0 0 1 … 0]
banana
[
fl
ies fruit - - - - a - - - - like …] neighbouring-word features
banana
[
fl
ies fruit - - - - a - - - - like …]
neighbouring-word w/
distance features
[-3 -2 - - - - +1 - - - - +3 …]
banana
[- #ba na# - - - ana - nan - - ban …]
n-grams (e.g., 3-grams)
[0 1 1 0 0 0 1 0 1 0 0 1 …]
Compositionality
A document (or a sentence) can be represented as an
aggregation of distributed representations


sum of the word vectors


average of the word vectors


…
thanks to https://www.shanelynn.ie
We build models of meaning
focusing on similarity


Each word = a vector


Similar words are "nearby in
space"


We de
fi
ne a word as a vector


called an "embedding" because
it's embedded into a space


The standard way to represent
meaning in NLP
Similarity-based representations
How do we de
fi
ne similarity?
Bhaskar Mitra, Nick Craswell: Neural Models for Information Retrieval. 

CoRR abs/1705.01509 (2017)
How do we de
fi
ne similarity?
Typicality (paradigmatic)
neighbouring-word w/ distance features
Topicality (syntagmatic)
in-document representation
Term embeddings
Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford)
Term embeddings
Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford)
(distributional semantic model)
(semantic vector space)
https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
(Word)
Embeddings
An embedding is a representation of items in a new
space s.t. the properties of the items are respected


For terms we can have sparse/explicit or dense/implicit
representations. Dense/implicit based on feature
predicting models are


easier to use as features in machine learning (less weights to
tune)


often perform better than embeddings based on explicit
counting (dense vectors may generalize better)


better at capturing synonymy


easy to visualise


but, hard to interpret, if it is even possible
Dense embeddings (code and vectors)
Word2Vec: https://code.google.com/archive/p/
word2vec/


Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
Jeffrey Dean. Distributed Representations of Words and Phrases
and their Compositionality. In Proceedings of NIPS, 2013.
Fasttext http://www.fasttext.cc/


A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. JĂ©gou, T.
Mikolov. 2016. FastText.zip: Compressing text classi
fi
cation
models
Glove http://nlp.stanford.edu/projects/glove/


Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. GloVe: Global Vectors for Word Representation.


https://nlp.stanford.edu/pubs/glove.pdf
Dense embeddings (frameworks)
Distributional semantics (pyDSM): https://
github.com/jimmycallin/pydsm
Semantic vectors (Java): https://github.com/
semanticvectors/semanticvectors
Semantic spaces (Java): https://github.com/
fozziethebeat/S-Space
Semantic spaces (Python): http://clic.cimec.unitn.it/
composes/toolkit/
Deep Learning 4 Java: https://deeplearning4j.org/
Learn about Apache Maven
fi
rst
Word2Vec
Popular embedding method


Very fast to train


Idea: predict rather than count


Instead of counting how often each word w2 occurs
near w1


Train a classi
fi
er on a binary prediction task:


Is w2 likely to show up near w1?


We don’t actually care about this task


But we'll take the learned classi
fi
er weights as the word
embeddings
Word2Vec
Popular embedding method


Very fast to train


Idea: predict rather than count


Instead of counting how often each word w2 occurs
near w1


Train a classi
fi
er on a binary prediction task:


Is w2 likely to show up near w1?


We don’t actually care about this task


But we'll take the learned classi
fi
er weights as the word
embeddings
Word2Vec (W2V) comes in two versions
implementing two di
ff
erent learning
architectures:
1) Skip-Gram
2) Continuous-Bag-Of-Words (CBOW)
Word2Vec: Skip-Gram
Skip-gram algorithm


1. Treat the target word and a neighboring context word as positive
examples


2. Randomly sample other words in the lexicon to get negative
samples


3. Use logistic regression to train a classi
fi
er to distinguish those
two cases


4. Use the weights as the embeddings
Word2Vec: Skip-Gram
Skip-gram algorithm


1. Treat the target word and a neighboring context word as positive
examples


2. Randomly sample other words in the lexicon to get negative
samples


3. Use logistic regression to train a classi
fi
er to distinguish those
two cases


4. Use the weights as the embeddings
In statistics, the logistic model (or logit
model) is a widely used statistical model
that, in its basic form, uses a logistic
function to model a binary dependent
variable
Word2Vec in Digital Pathology
fasttext (Facebook)
FastText is a library created by the Facebook Research
Team for ef
fi
cient learning of word representations
and sentence classi
fi
cation


FastText assumes a word to be formed by a n-grams of
character


for example, sunny is composed of [sun, sunn,sunny],
[sunny,unny,nny]  etc,


where n could range from 1 to the length of the word.
fasttext
This new representation of word by fastText provides
the following bene
fi
ts over word2vec or glove:


Rare words: It allows us to
fi
nd the vector representation for
rare words. Since rare words could still be broken into character
n-grams, they could share these n-grams with the common
words.


OOV: It can give the vector representations for the words not
present in the dictionary (OOV words) since these can also be
broken down into character n-grams.


character n-grams embeddings tend to perform superior to
word2vec and glove on smaller datasets.
https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classi
fi
cation-using-fasttext-nlp-facebook/
FastText in Digital Pathology
Caveats: Training and Compositionality
Training dense vectors could very expensive (time and
resources)


A simple solution: Use pre-trained vectors


Fasttext: https://fasttext.cc/docs/en/unsupervised-tutorial.html


Word2Vec: https://code.google.com/archive/p/word2vec/


Once we have the word embeddings we can build a
document (paragraph, sentence) embedding by putting
the word embeddings together


Average the vectors


Sum the vectors
Medical Word Embeddings
Word embeddings often ignore the internal structure of
words and external knowledge


Knowledge bases and ontologies can be really
important in specialised domains (like the biomedical
one)


There are speci
fi
c word embeddings for the biomedical
domain, e.g. BioWordVec (Scienti
fi
c Data, 2019) or
BioBert (Lee et al., 2019)
Medical Word Embeddings
A crucial success factor is where the embeddings are
trained


There is no one-
fi
ts-all solution and general neural
networks-based solutions generally do not outperform
traditional solutions


Nevertheless, there is a wide availability of biomedical
training data and neural networks are groundbreaking
for text processing (and related tasks)

More Related Content

What's hot

Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional SemanticsAndre Freitas
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextBayu Aldi Yansyah
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos taggingThennarasuSakkan
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONcscpconf
 
The Duet model
The Duet modelThe Duet model
The Duet modelBhaskar Mitra
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1ATHMAN HAJ-HAMOU
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
An introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAn introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAndre Freitas
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
Utilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative dataUtilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative dataMerlien Institute
 

What's hot (20)

Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastText
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos tagging
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Word embedding
Word embedding Word embedding
Word embedding
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
An introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAn introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semantics
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Utilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative dataUtilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative data
 

Similar to DDH 2021-03-03: Text Processing and Searching in the Medical Domain

Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databasesMuhammad Shoaib Chaudhary
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448IJERA Editor
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfJemalNesre1
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
The Usage of Because of-Words in British National Corpus
 The Usage of Because of-Words in British National Corpus The Usage of Because of-Words in British National Corpus
The Usage of Because of-Words in British National CorpusResearch Journal of Education
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introductionThennarasuSakkan
 
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceMarcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceDaniel Lewis
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic ComputingMeena Nagarajan
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Textsijcnes
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 

Similar to DDH 2021-03-03: Text Processing and Searching in the Medical Domain (20)

Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databases
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
NLP
NLPNLP
NLP
 
nlp (1).pptx
nlp (1).pptxnlp (1).pptx
nlp (1).pptx
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
Class14
Class14Class14
Class14
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
The Usage of Because of-Words in British National Corpus
 The Usage of Because of-Words in British National Corpus The Usage of Because of-Words in British National Corpus
The Usage of Because of-Words in British National Corpus
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
 
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceMarcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 

Recently uploaded

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiologyDrAnita Sharma
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh
 

Recently uploaded (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiology
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 

DDH 2021-03-03: Text Processing and Searching in the Medical Domain

  • 1. Gianmaria Silvello 
 Department of Information Engineering 
 University of Padua gianmaria.silvello@unipd.it http://www.dei.unipd.it/~silvello/ @giansilv Intelligent Interactive Information Access Hub Text Processing and Searching in the Medical Domain
  • 2. Outline Introduction to Text Processing Text processing typical pipeline Word Embeddings
  • 3. Text Processing Image from: https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958
  • 4. Text processing pipeline Tokenization T It is a way of separating a piece of text into smaller units called tokens Word Tokenization
 Pros: Very common, Use of pre- trained embeddings Cons: Out Of Vocabulary (OOV) Cons: Large vocabulary Char Tokenization
 Pros: No OOV, limit vocabulary size Cons: Hard to get the relation between chars and word meaning N-Gram Tokenization
 Pros: No OOV, limit vocabulary size Cons: Computationally more expensive than word tokenisation Tokenization T polyp distal sigma: tubulovilloso adenoma with severe dysplasia/carcinoma in intramucosal minor fragment. in increased fragment aspects are severe glandular epithelial dysplasia. polyp: 1 distal: 1 sigma: 1 tubulovilloso: 1 adenoma: 1 with: 1 severe: 2 dysplasia: 1 carcinoma: 1 in: 2 intramucosal: 1 minor: 1 fragment: 2 increased: 1 aspects: 1 … p: 6 o: 8 l: 12 y: 3 d:6 i: 14 … pol: 2 oly: 2 lyp: 2 yp_: 2 … Word char n-gram
  • 5. Text processing pipeline Tokenization T Stopword s removal SR Stopwords removal is a key step in text processing The distribution of words is not statistical uniform 
 Luhn observation / Zipf’s Law 
 
 Stopwords S polyp distal sigma tubulovilloso adenoma with severe dysplasia carcinoma in intramucosal minor fragment in increased fragment aspects are severe glandular epithelial dysplasia
  • 6. Text processing pipeline Tokenization T Stopword s removal SR S Stemming Stemming reduces the words to their root Rule-based stemmers for languages with high linguistic resources (e.g., English) Porter Lovins Paice 
 Statistical stemmers for languages with low linguistic resources (e.g., Hindi) 
 FBC GRASS SNS YASS N-grams are an alternative to stemming Stemming S polyp distal sigma tubulovilloso adenoma severe dysplasia carcinoma intramucosal minor fragment increased fragment aspects severe glandular epithelial dysplasia
  • 7. Text processing pipeline Tokenization T Stopword s removal SR S Stemming POS Part-of-Speec h tagging Words can be grouped into classes referred to as Part of Speech (PoS) or morphological classes 
 
 The word PoS provides crucial information to determine the roles of the word itself and of the words close to it in the sentence 
 
 The 4 largest open classes of words, present in most of the languages, are â–« nouns â–« verbs â–« adverbs â–« adjectives 
 
 Rule-based POS Probabilistic approaches (HMM) Part-Of-Speech POS
  • 8. Text processing pipeline Tokenization T Stopword s removal SR S Stemming POS Part-of-Speec h tagging Named entity recognition (NER) ‒ also called entity identi fi cation or entity extraction ‒ is an information extraction technique that automatically identi fi es named entities in a text and classi fi es them into prede fi ned categories. 
 


 Lexicon approach Rule-based systems Machine learning-based systems Hybrid approach Named Entity Recognition NER NER Named Entity Recognition polyp distal sigma tubulovilloso adenoma severe dysplasia carcinoma intramucosal minor fragment increased fragment aspects severe glandular epithelial dysplasia
  • 9. Text processing pipeline Colon biopsy spots scar of previous polypectomy and colonic mucosa fragment with fi brosis. no evidence in fl ammation, dysplasia or malignancy. Let’s consider a short medical report about a colon biopsy
  • 10. Text processing pipeline Colon biopsy spots scar of previous polypectomy and colonic mucosa fragment with fi brosis. no evidence in fl ammation, dysplasia or malignancy. Let’s consider a short medical report about a colon biopsy Tokenization T
  • 11. Text processing pipeline Colon biopsy spots scar of previous polypectomy and colonic mucosa fragment with fi brosis. no evidence in fl ammation, dysplasia or malignancy. Let’s consider a short medical report about a colon biopsy <Colon, biopsy, spots, scar, of, previous, polypectomy, and, colonic, mucosa, fragment, with, fi brosis, no, evidence, in fl ammation, dysplasia, or, malignancy> Stopword s removal SR Stopwords removal might be problematic. Here we’d lost the negation A similar problem might occur with stemming
  • 12. Text processing pipeline Let’s consider a short medical report about a colon biopsy POS Part-of-Speec h tagging Colon biopsy spots scar of previous polypectomy and colonic mucosa fragment with fi brosis. no evidence in fl ammation, dysplasia or malignancy.
  • 13. Text processing pipeline Let’s consider a short medical report about a colon biopsy NER Named Entity Recognition
  • 15. Term representation Vector representations are central both for in many applications (e.g., IR and Machine Learning) Usually, we focus on terms as the smallest unit of representation …but, we may also consider n-grams Different representations lead to different notions of similarities … and different properties of “compositionality" to build passage or document representations
  • 16. Local representation of terms Local (one-hot vector) representation There is a fi xed vocabulary V Size of the vectors is |V| 1 means the term is present, 0 means the term is not there banana is a fruit 0 1 2 3 sentence term index [1 0 0 0] [0 1 0 0] [0 0 1 0] [0 0 0 1] banana is a fruit
  • 17. Local representation banana is a fruit as well as mango; dog is an animal 0 1 2 3 4 sentence term index
  • 18. Local representation 0 1 2 3 4 sentence term index banana is a fruit as well as mango; dog is an animal
  • 19. Local representation 0 1 2 3 4 sentence term index banana is a fruit as well as mango; dog is an animal [1 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 0] banana fruit mango dog [0 0 0 0 1] animal We cannot use this representation to de fi ning similarity between terms High-dimensionality Sparse With TF-IDF weighting the problem does not change
  • 20. Local representation 0 1 2 3 4 sentence term index banana is a fruit as well as mango; dog is an animal [1 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 0] banana fruit mango dog [0 0 0 0 1] animal We cannot use this representation to de fi ning similarity between terms High-dimensionality Sparse With TF-IDF weighting the problem does not change We are ignoring context
  • 21. Distributed representation Under distributed representations every term is represented by a vector A vector of hand-crafted features or a learnt representation in which the individual dimensions are not interpretable in isolation. Use an implicit notion of similarity between words: “banana” is more similar to “mango” than “dog” because they are both fruits, but yet different because of other properties that are not shared between the two, such as shape
  • 22. Distributed representation Distributional Hypothesis: terms that are used (or occur) in similar context tend to be semantically similar [1] Distributional Semantics: a word is characterised by the company it keeps [2] [1] Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162. [2] John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. (1957).
  • 23. Distributed representation (with context) [1 0 1 0 0 0 0 1 1 0 0 0 0 … 0] banana [d0 - d2 - - - - d7 d8 - - - - … -] in-document representation [1] [1] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391–407. [2] Geo ff rey E Hinton. 1984. Distributed representations. (1984). [1 1 0 0 0 0 1 0 0 0 0 1 … 0] banana [ fl ies fruit - - - - a - - - - like …] neighbouring-word features banana [ fl ies fruit - - - - a - - - - like …] neighbouring-word w/ distance features [-3 -2 - - - - +1 - - - - +3 …] banana [- #ba na# - - - ana - nan - - ban …] n-grams (e.g., 3-grams) [0 1 1 0 0 0 1 0 1 0 0 1 …]
  • 24. Compositionality A document (or a sentence) can be represented as an aggregation of distributed representations sum of the word vectors average of the word vectors …
  • 25. thanks to https://www.shanelynn.ie We build models of meaning focusing on similarity Each word = a vector Similar words are "nearby in space" We de fi ne a word as a vector called an "embedding" because it's embedded into a space The standard way to represent meaning in NLP Similarity-based representations
  • 26. How do we de fi ne similarity? Bhaskar Mitra, Nick Craswell: Neural Models for Information Retrieval.  CoRR abs/1705.01509 (2017)
  • 27. How do we de fi ne similarity? Typicality (paradigmatic) neighbouring-word w/ distance features Topicality (syntagmatic) in-document representation
  • 28. Term embeddings Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford)
  • 29. Term embeddings Slides inspired by Dan Jurafsky and James Martin. Speech and Language Processing (Stanford) (distributional semantic model) (semantic vector space) https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/ (Word)
  • 30. Embeddings An embedding is a representation of items in a new space s.t. the properties of the items are respected For terms we can have sparse/explicit or dense/implicit representations. Dense/implicit based on feature predicting models are easier to use as features in machine learning (less weights to tune) often perform better than embeddings based on explicit counting (dense vectors may generalize better) better at capturing synonymy easy to visualise but, hard to interpret, if it is even possible
  • 31. Dense embeddings (code and vectors) Word2Vec: https://code.google.com/archive/p/ word2vec/ Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. Fasttext http://www.fasttext.cc/ A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. JĂ©gou, T. Mikolov. 2016. FastText.zip: Compressing text classi fi cation models Glove http://nlp.stanford.edu/projects/glove/ Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/pubs/glove.pdf
  • 32. Dense embeddings (frameworks) Distributional semantics (pyDSM): https:// github.com/jimmycallin/pydsm Semantic vectors (Java): https://github.com/ semanticvectors/semanticvectors Semantic spaces (Java): https://github.com/ fozziethebeat/S-Space Semantic spaces (Python): http://clic.cimec.unitn.it/ composes/toolkit/ Deep Learning 4 Java: https://deeplearning4j.org/ Learn about Apache Maven fi rst
  • 33. Word2Vec Popular embedding method Very fast to train Idea: predict rather than count Instead of counting how often each word w2 occurs near w1 Train a classi fi er on a binary prediction task: Is w2 likely to show up near w1? We don’t actually care about this task But we'll take the learned classi fi er weights as the word embeddings
  • 34. Word2Vec Popular embedding method Very fast to train Idea: predict rather than count Instead of counting how often each word w2 occurs near w1 Train a classi fi er on a binary prediction task: Is w2 likely to show up near w1? We don’t actually care about this task But we'll take the learned classi fi er weights as the word embeddings Word2Vec (W2V) comes in two versions implementing two di ff erent learning architectures: 1) Skip-Gram 2) Continuous-Bag-Of-Words (CBOW)
  • 35. Word2Vec: Skip-Gram Skip-gram algorithm 1. Treat the target word and a neighboring context word as positive examples 
 2. Randomly sample other words in the lexicon to get negative samples 
 3. Use logistic regression to train a classi fi er to distinguish those two cases 
 4. Use the weights as the embeddings
  • 36. Word2Vec: Skip-Gram Skip-gram algorithm 1. Treat the target word and a neighboring context word as positive examples 
 2. Randomly sample other words in the lexicon to get negative samples 
 3. Use logistic regression to train a classi fi er to distinguish those two cases 
 4. Use the weights as the embeddings In statistics, the logistic model (or logit model) is a widely used statistical model that, in its basic form, uses a logistic function to model a binary dependent variable
  • 37. Word2Vec in Digital Pathology
  • 38. fasttext (Facebook) FastText is a library created by the Facebook Research Team for ef fi cient learning of word representations and sentence classi fi cation FastText assumes a word to be formed by a n-grams of character for example, sunny is composed of [sun, sunn,sunny], [sunny,unny,nny]  etc, where n could range from 1 to the length of the word.
  • 39. fasttext This new representation of word by fastText provides the following bene fi ts over word2vec or glove: Rare words: It allows us to fi nd the vector representation for rare words. Since rare words could still be broken into character n-grams, they could share these n-grams with the common words. OOV: It can give the vector representations for the words not present in the dictionary (OOV words) since these can also be broken down into character n-grams. character n-grams embeddings tend to perform superior to word2vec and glove on smaller datasets. https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classi fi cation-using-fasttext-nlp-facebook/
  • 40. FastText in Digital Pathology
  • 41. Caveats: Training and Compositionality Training dense vectors could very expensive (time and resources) A simple solution: Use pre-trained vectors Fasttext: https://fasttext.cc/docs/en/unsupervised-tutorial.html Word2Vec: https://code.google.com/archive/p/word2vec/ Once we have the word embeddings we can build a document (paragraph, sentence) embedding by putting the word embeddings together Average the vectors Sum the vectors
  • 42. Medical Word Embeddings Word embeddings often ignore the internal structure of words and external knowledge Knowledge bases and ontologies can be really important in specialised domains (like the biomedical one) There are speci fi c word embeddings for the biomedical domain, e.g. BioWordVec (Scienti fi c Data, 2019) or BioBert (Lee et al., 2019)
  • 43. Medical Word Embeddings A crucial success factor is where the embeddings are trained There is no one- fi ts-all solution and general neural networks-based solutions generally do not outperform traditional solutions Nevertheless, there is a wide availability of biomedical training data and neural networks are groundbreaking for text processing (and related tasks)