A machine learning approach to clinical terms normalization

A Machine Learning Approach to Clinical Terms
Normalization
H. Berinsky, J. Castaño, M. Gambarte, D. Perez,
H. Park, M. Ávila Williams, F. Campos,
D. Luna, S. Benitez, S. Zanetti
Depto. de Informática en Salud, Hospital Italiano de Buenos Aires
hernan.berinsky@hospitalitaliano.org.ar
Depto. de Computación, FCEyN, Universidad de Buenos Aires
jcastano@dc.uba.ar
August 12, 2016
H. Berinsky, J. Castaño, M. Gambarte, D. Perez, H. Park, M. Ávila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 1 / 15

Context
Terminology Services
SNOMED-CT as reference terminology
HIBA terminology
Interface vocabulary

Interface vocabulary
Objective
Semantic recognition of clinical term descriptions
Problems (domain)
Clinical findings, family history, suspected disease
Lexical variability and noise
Descriptions contain acronyms, abbreviations, typos, irrelevant data
Difficult to develop a rule-based approach due to ’long-tail’ nature of the
problem
String matching
Drawbacks with approximate string matching (fuzzy string matching) e.g.
Levenshtein or Jaccard in clinical domain.
sospecha de laringitis alérgica sospecha de faringitis alérgica
sospecha de laringitis alérgica probable laringitis alérgica
antec fliar de madre con hipotiroidismo antec fliar de padre con hipertiroidismo
antecedente familiar de madre con hipotiroidismo madre con hipertiroidismo
embarazo 7 semanas embarazo 20 semanas
fractura de cadera ayer al mediod´ıa fractura de cadera hace 2 semanas

Soft-TF-IDF Information retrieval approach (baseline)
Build n-gram inverted index (tolerant retrieval)
vector of bigram character TF-IDF weighting schema (ltc.nnc)
Classification rule: a match if score ≥ t, t a threshold in top result
Validation: corpus + queries (partition)
Metrics: precision(t), recall(t), F1(t)
Precision-recall trade-off controlled by t:
precision(t) increasing
recall(t) decreasing
Evaluation (results): F1 = 0.74
Query sosp faringitis alergica
Results
description score
sosp laringitis alérgica 0.95
sospecha faringitis alérgica 0.71
False positive
Query antec fliar de ca pulmonar padre biolog
Results (not found)
description score
antecedente familiar de neoplasia 0.44
maligna de pulmón en padre natural
False negative

Machine learning approach
Logic/rule-based approach (knowledge engineering perspective)
Diﬃcult to encode system semantics, noise, ambiguity and errors, does not
scale up
Machine learning approach
Learn to match clinical term descriptions based on current knowledge
valid/invalid matchings. Steps:
Dataset construction
Features generation
Training (MaxEnt, XGBoost *)
Evaluation
XGBoost *
Gradient boosting
Ensemble of trees (weak learners)
Additive training, iteratively add tree that most improve the model
Regularization: tree complexity, shrinkage, stochastic gradient boosting
(bagging)

For each pair of validated descriptions {d1, d2} in the corpus, create:
a positive example if they belong to the same concept (target = 1)
a negative example if d2 is a false positive result when query is d1 (target = 0)
Corpus (example)
concept description
sospecha de faringitis sospecha de faringitis
sosp faringitis
sospecha de laringitis sospecha de laringitis
sosp laringitis
sos laringitis
Dataset
d1 d2 target
sosp laringitis sospecha laringitis 1
sos faringitis sosp faringitis 1
sospecha de faringitis sospecha de laringitis 0
sosp de faringitis sosp laringitis 0
sos de faringitis sosp laringitis 0

Real data looks like...
d1 d2 target
antec fliar de madre con hipertiroidismo AF de madre con hipertiridismo 1
antec fliar de madre con hipertiroidismo AF de madre con hipotiroidismo 0
madre con hipotiroidismo atc fam madre hipotirosidismo 1
ant fam de padre con diabetes antec familiar padre con diabetes 1
ant familiar de padre con diabetes antec familiar madre con diabetes 0
antecedente fam de padre cáncer renal AF de padre cariñón 1
ca de piel hace 3 meses neoplasia maligna de piel 1
abandono madre biológica febrero 2002 fuga del hogar de la madre natural 1
muerte por asfixia en incendio forestal fallec por asfixia en un incendio 1
fractura de cadera por un accidente fractura de cadera debido a accidente 1

Features
S1 L1 = length(d1)
L2 = length(d2)
m, M = min(L1, L2), max(L1, L2)
M − m
m/M
Levenshteinratio(d1, d2)
Jaccard(d1, d2)
S2 vector of binary (w, d1)
vector of binary (w, d2)
S3 vector of TF-IDF (w, d1)
vector of TF-IDF (w, d2)
S4 vector of TF (b, d1)
vector of TF (b, d2)
S5 vector of TF-IDF (b, d1)
vector of TF-IDF (b, d2)
S6 vector of binary (w, d12)
vector of binary (w, d21)
S7 vector of TF-IDF (b, d12)
vector of TF-IDF (b, d21)
S8 vector of TF (w, d12)
vector of TF (w, d21)
vector of TF (w, c)
S9 vector of TF (b, d12)
vector of TF (b, d21)
vector of TF (b, c)
S10 Word groups (w, d12)
Word groups (w, d21)
Word groups (w, c)
w: unigram word
c: bigram character
d12 = words(d1) words(d2)
d21 = words(d2) words(d1)
c = w(d1) ∩ w(d2)
Example (unigram word)
d1 = fractura de rodilla izquierda
d2 = fractura de rodilla izq
w(d1) = {fractura, de, rodilla, izquierda}
w(d2) = {fractura, de, rodilla, izq}
d12 = {izquierda}
d21 = {izq}
c = {fractura, de, rodilla}

Word groups (main idea)
Given positive examples in dataset
d1 d2
antecedente familiar de madre con hipertiroidismo AF de madre con hipertiroidismo
antecedente familiar de padre cáncer de h´ıgado AF de padre ca de higado
ca de piel neoplasia maligna de piel
abandono de madre biológica abandono de madre natural
muerte por asfixia fallecimiento por asfixia
fractura de cadera a causa de accidente fractura de cadera debido a accidente
Infer semantic equivalence classes
{{deceso, fallecimiento, muerte}, {biológico, natural}, {debido a, a causa de}, {cáncer, ca,
neoplasia maligna}, {renal, de riñón}}
Then
Discovered knowledge allow to recognize the following 72 descriptions as semantically equivalent
(among others)
Duelo por



deceso
fallecimiento
muerte



de padre
biológico
natural
debido a
a causa de



cáncer
ca
neoplasia maligna



renal
de riñón

Word groups
d1 d2 target
sospecha de dengue probable dengue 1
sospecha de ACV posible ACV 1
sosp tumor renal probable tumor renal 1
... ... ...
Semantic equivalence pairs:
C =
{(sospecha, probable), (sospecha, posible), (sosp, probable)}
Semantic equivalence inference procedure
Build a undirected weighted graph
G = (V , E, W )
where E = {({d12, d21}, w) : (d12, d21) ∈ C,
w = frequency(d12, d21)}
Remove edges in G if w < t for some
threshold t
Find connected components in G
What happen with ambiguous concepts?

Word groups
Ambiguous cases
Ambiguous words are connected to multiple concepts, e.g. od acronym connected to
oido derecho (right ear), ojo derecho (right eye), ovario derecho (right ovary)
Multiple concepts are in the same connected component
Mitigation
Label propagation algorithm (community detection in complex networks)
No parameter is required to be known beforehand (e.g. number of clusters)
For such connected components we run label propagation algorithm (community
detection)
Main idea: if a vertex v is connected to v1, ..., vk where each vi has a label L(vi ), each
vertex v ∈ G chooses to join the community to which the maximum number of its
neighbors belong to (ties broken uniformly randomly)
Clustering for is evaluated using modularity measure

Results
Detected: 1,496 semantic equivance classes with 9,152 words
Graph 9,926 items (words) (threshold t = 3)
Connected components
3,289 items (1,004 unambiguous components) + 6,637 items (ambiguous components)
Label propagation (clustering) (1st execution) (6,637 items)
5,831 items (487 unambiguous clusters) + 806 items (ambiguous clusters)
Label propagation (clustering) (2nd execution) (806 items)
32 items (5 unambiguous clusters) + 774 items (1 ambiguous cluster)
Examples
aumento aumento, elevacion, alza, ascenso, incremento
boca boca, bucal, bucales, oral, orales, yugal
conyugue conyugue, conyuge, esposa, esposo, marido, pareja, novia, matrimonial, maritale
cutaneo cutaneo, cutanea, dermatologica, dermica, dermico, piel, peil
fractura fractura, fx, fc, fratura, fracura, fract, fr
fumador fumador, fumadora, tabaco, tabaquismo, tabaquista, tqb
infantil infantil, pediatrico, pedriatica
izquierda izquierda, izq, izquierdo, izquierda, izqdo, izqda, izda
paciente paciente, pac, pact, pacte, pte, pcte
postoperatorio postoperatorio, postquirurgico, postqx, posqx, postop, posop, pop
quimioterapia quimioterapia, qmt, qt, quimio, pqt
sindrome sindrome, sme, enfermedad, sd, sind, enf, cuadro, sind, sindorme, sdme, sdr, sde
traumatismo traumatismo, trauma, tx, trauamtismo, trumatismo, trauma, golpe, tmo

Classiﬁcation experiments
Unigram word features
Features Weight MaxEnt (F1) XGBoost (F1)
(S2) d1, d2 binary 0.59 0.59
(S3) d1, d2 tf-idf 0.59 0.58
(S6) d12, d21 binary 0.63 0.62
(S8) d12, d21, c binary 0.76 0.62
Bigram character features
Features Weight MaxEnt (F1) XGBoost (F1)
(S4) d1, d2 freq. 0.57 0.76
(S5) d1, d2 tf-idf 0.56 0.74
(S7) d12, d21 freq. 0.58 0.76
(S9) d12, d21, c freq. 0.72 0.77

Classiﬁcation experiments
Top-1 result
Model Prec Rec F1
IR 0.73 0.76 0.74
MaxEnt (S1) 0.66 0.74 0.70
XGBoost (S1) 0.65 0.70 0.68
MaxEnt (S8) 0.74 0.78 0.76
XGBoost (S9) 0.75 0.79 0.77
MaxEnt (S1, S8, S10) 0.87 0.91 0.89
XGBoost (S1, S9, S10) 0.87 0.91 0.89

Conclusions & Future work
Outperforms Soft-TFIDF (baseline)
Does not require lexical knowledge (acronyms, abbreviations, synonyms) nor
spell checkers (acquired from examples)
Unsupervised learning of synonyms, abbreviations and typos improve results
obtained through string similarity features
No speciﬁc resource for Spanish, our approach can be replicated in any
language
Possible to use query expansion techniques

A machine learning approach to clinical terms normalization

Recommended

Recommended

More Related Content

Similar to A machine learning approach to clinical terms normalization

Similar to A machine learning approach to clinical terms normalization (12)

Recently uploaded

Recently uploaded (20)

A machine learning approach to clinical terms normalization