SlideShare a Scribd company logo
A Machine Learning Approach to Clinical Terms
Normalization
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez,
H. Park, M. ´Avila Williams, F. Campos,
D. Luna, S. Benitez, S. Zanetti
Depto. de Inform´atica en Salud, Hospital Italiano de Buenos Aires
hernan.berinsky@hospitalitaliano.org.ar
Depto. de Computaci´on, FCEyN, Universidad de Buenos Aires
jcastano@dc.uba.ar
August 12, 2016
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 1 / 15
Context
Terminology Services
SNOMED-CT as reference terminology
HIBA terminology
Interface vocabulary
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 2 / 15
Interface vocabulary
Objective
Semantic recognition of clinical term descriptions
Problems (domain)
Clinical findings, family history, suspected disease
Lexical variability and noise
Descriptions contain acronyms, abbreviations, typos, irrelevant data
Difficult to develop a rule-based approach due to ’long-tail’ nature of the
problem
String matching
Drawbacks with approximate string matching (fuzzy string matching) e.g.
Levenshtein or Jaccard in clinical domain.
sospecha de laringitis al´ergica sospecha de faringitis al´ergica
sospecha de laringitis al´ergica probable laringitis al´ergica
antec fliar de madre con hipotiroidismo antec fliar de padre con hipertiroidismo
antecedente familiar de madre con hipotiroidismo madre con hipertiroidismo
embarazo 7 semanas embarazo 20 semanas
fractura de cadera ayer al mediod´ıa fractura de cadera hace 2 semanas
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 3 / 15
Soft-TF-IDF Information retrieval approach (baseline)
Build n-gram inverted index (tolerant retrieval)
vector of bigram character TF-IDF weighting schema (ltc.nnc)
Classification rule: a match if score ≥ t, t a threshold in top result
Validation: corpus + queries (partition)
Metrics: precision(t), recall(t), F1(t)
Precision-recall trade-off controlled by t:
precision(t) increasing
recall(t) decreasing
Evaluation (results): F1 = 0.74
Query sosp faringitis alergica
Results
description score
sosp laringitis al´ergica 0.95
sospecha faringitis al´ergica 0.71
False positive
Query antec fliar de ca pulmonar padre biolog
Results (not found)
description score
antecedente familiar de neoplasia 0.44
maligna de pulm´on en padre natural
False negative
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 4 / 15
Machine learning approach
Logic/rule-based approach (knowledge engineering perspective)
Difficult to encode system semantics, noise, ambiguity and errors, does not
scale up
Machine learning approach
Learn to match clinical term descriptions based on current knowledge
valid/invalid matchings. Steps:
Dataset construction
Features generation
Training (MaxEnt, XGBoost *)
Evaluation
XGBoost *
Gradient boosting
Ensemble of trees (weak learners)
Additive training, iteratively add tree that most improve the model
Regularization: tree complexity, shrinkage, stochastic gradient boosting
(bagging)
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 5 / 15
Dataset construction
For each pair of validated descriptions {d1, d2} in the corpus, create:
a positive example if they belong to the same concept (target = 1)
a negative example if d2 is a false positive result when query is d1 (target = 0)
Corpus (example)
concept description
sospecha de faringitis sospecha de faringitis
sosp faringitis
sospecha de laringitis sospecha de laringitis
sosp laringitis
sos laringitis
Dataset
d1 d2 target
sosp laringitis sospecha laringitis 1
sos faringitis sosp faringitis 1
sospecha de faringitis sospecha de laringitis 0
sosp de faringitis sosp laringitis 0
sos de faringitis sosp laringitis 0
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 6 / 15
Dataset construction
Real data looks like...
d1 d2 target
antec fliar de madre con hipertiroidismo AF de madre con hipertiridismo 1
antec fliar de madre con hipertiroidismo AF de madre con hipotiroidismo 0
madre con hipotiroidismo atc fam madre hipotirosidismo 1
ant fam de padre con diabetes antec familiar padre con diabetes 1
ant familiar de padre con diabetes antec familiar madre con diabetes 0
antecedente fam de padre c´ancer renal AF de padre cari˜n´on 1
ca de piel hace 3 meses neoplasia maligna de piel 1
abandono madre biol´ogica febrero 2002 fuga del hogar de la madre natural 1
muerte por asfixia en incendio forestal fallec por asfixia en un incendio 1
fractura de cadera por un accidente fractura de cadera debido a accidente 1
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 7 / 15
Features
S1 L1 = length(d1)
L2 = length(d2)
m, M = min(L1, L2), max(L1, L2)
M − m
m/M
Levenshteinratio(d1, d2)
Jaccard(d1, d2)
S2 vector of binary (w, d1)
vector of binary (w, d2)
S3 vector of TF-IDF (w, d1)
vector of TF-IDF (w, d2)
S4 vector of TF (b, d1)
vector of TF (b, d2)
S5 vector of TF-IDF (b, d1)
vector of TF-IDF (b, d2)
S6 vector of binary (w, d12)
vector of binary (w, d21)
S7 vector of TF-IDF (b, d12)
vector of TF-IDF (b, d21)
S8 vector of TF (w, d12)
vector of TF (w, d21)
vector of TF (w, c)
S9 vector of TF (b, d12)
vector of TF (b, d21)
vector of TF (b, c)
S10 Word groups (w, d12)
Word groups (w, d21)
Word groups (w, c)
w: unigram word
c: bigram character
d12 = words(d1)  words(d2)
d21 = words(d2)  words(d1)
c = w(d1) ∩ w(d2)
Example (unigram word)
d1 = fractura de rodilla izquierda
d2 = fractura de rodilla izq
w(d1) = {fractura, de, rodilla, izquierda}
w(d2) = {fractura, de, rodilla, izq}
d12 = {izquierda}
d21 = {izq}
c = {fractura, de, rodilla}
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 8 / 15
Word groups (main idea)
Given positive examples in dataset
d1 d2
antecedente familiar de madre con hipertiroidismo AF de madre con hipertiroidismo
antecedente familiar de padre c´ancer de h´ıgado AF de padre ca de higado
ca de piel neoplasia maligna de piel
abandono de madre biol´ogica abandono de madre natural
muerte por asfixia fallecimiento por asfixia
fractura de cadera a causa de accidente fractura de cadera debido a accidente
Infer semantic equivalence classes
{{deceso, fallecimiento, muerte}, {biol´ogico, natural}, {debido a, a causa de}, {c´ancer, ca,
neoplasia maligna}, {renal, de ri˜n´on}}
Then
Discovered knowledge allow to recognize the following 72 descriptions as semantically equivalent
(among others)
Duelo por



deceso
fallecimiento
muerte



de padre
biol´ogico
natural
debido a
a causa de



c´ancer
ca
neoplasia maligna



renal
de ri˜n´on
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 9 / 15
Word groups
d1 d2 target
sospecha de dengue probable dengue 1
sospecha de ACV posible ACV 1
sosp tumor renal probable tumor renal 1
... ... ...
Semantic equivalence pairs:
C =
{(sospecha, probable), (sospecha, posible), (sosp, probable)}
Semantic equivalence inference procedure
Build a undirected weighted graph
G = (V , E, W )
where E = {({d12, d21}, w) : (d12, d21) ∈ C,
w = frequency(d12, d21)}
Remove edges in G if w < t for some
threshold t
Find connected components in G
What happen with ambiguous concepts?
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 10 / 15
Word groups
Ambiguous cases
Ambiguous words are connected to multiple concepts, e.g. od acronym connected to
oido derecho (right ear), ojo derecho (right eye), ovario derecho (right ovary)
Multiple concepts are in the same connected component
Mitigation
Label propagation algorithm (community detection in complex networks)
No parameter is required to be known beforehand (e.g. number of clusters)
For such connected components we run label propagation algorithm (community
detection)
Main idea: if a vertex v is connected to v1, ..., vk where each vi has a label L(vi ), each
vertex v ∈ G chooses to join the community to which the maximum number of its
neighbors belong to (ties broken uniformly randomly)
Clustering for is evaluated using modularity measure
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 11 / 15
Results
Detected: 1,496 semantic equivance classes with 9,152 words
Graph 9,926 items (words) (threshold t = 3)
Connected components
3,289 items (1,004 unambiguous components) + 6,637 items (ambiguous components)
Label propagation (clustering) (1st execution) (6,637 items)
5,831 items (487 unambiguous clusters) + 806 items (ambiguous clusters)
Label propagation (clustering) (2nd execution) (806 items)
32 items (5 unambiguous clusters) + 774 items (1 ambiguous cluster)
Examples
aumento aumento, elevacion, alza, ascenso, incremento
boca boca, bucal, bucales, oral, orales, yugal
conyugue conyugue, conyuge, esposa, esposo, marido, pareja, novia, matrimonial, maritale
cutaneo cutaneo, cutanea, dermatologica, dermica, dermico, piel, peil
fractura fractura, fx, fc, fratura, fracura, fract, fr
fumador fumador, fumadora, tabaco, tabaquismo, tabaquista, tqb
infantil infantil, pediatrico, pedriatica
izquierda izquierda, izq, izquierdo, izquierda, izqdo, izqda, izda
paciente paciente, pac, pact, pacte, pte, pcte
postoperatorio postoperatorio, postquirurgico, postqx, posqx, postop, posop, pop
quimioterapia quimioterapia, qmt, qt, quimio, pqt
sindrome sindrome, sme, enfermedad, sd, sind, enf, cuadro, sind, sindorme, sdme, sdr, sde
traumatismo traumatismo, trauma, tx, trauamtismo, trumatismo, trauma, golpe, tmo
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 12 / 15
Classification experiments
Unigram word features
Features Weight MaxEnt (F1) XGBoost (F1)
(S2) d1, d2 binary 0.59 0.59
(S3) d1, d2 tf-idf 0.59 0.58
(S6) d12, d21 binary 0.63 0.62
(S8) d12, d21, c binary 0.76 0.62
Bigram character features
Features Weight MaxEnt (F1) XGBoost (F1)
(S4) d1, d2 freq. 0.57 0.76
(S5) d1, d2 tf-idf 0.56 0.74
(S7) d12, d21 freq. 0.58 0.76
(S9) d12, d21, c freq. 0.72 0.77
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 13 / 15
Classification experiments
Top-1 result
Model Prec Rec F1
IR 0.73 0.76 0.74
MaxEnt (S1) 0.66 0.74 0.70
XGBoost (S1) 0.65 0.70 0.68
MaxEnt (S8) 0.74 0.78 0.76
XGBoost (S9) 0.75 0.79 0.77
MaxEnt (S1, S8, S10) 0.87 0.91 0.89
XGBoost (S1, S9, S10) 0.87 0.91 0.89
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 14 / 15
Conclusions & Future work
Outperforms Soft-TFIDF (baseline)
Does not require lexical knowledge (acronyms, abbreviations, synonyms) nor
spell checkers (acquired from examples)
Unsupervised learning of synonyms, abbreviations and typos improve results
obtained through string similarity features
No specific resource for Spanish, our approach can be replicated in any
language
Possible to use query expansion techniques
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 15 / 15

More Related Content

Similar to A machine learning approach to clinical terms normalization

Biostatics introduction
Biostatics introductionBiostatics introduction
Biostatics introduction
Midhun Mkc
 
Role of Human Genome Project in Medical Science
Role of Human Genome Project in Medical ScienceRole of Human Genome Project in Medical Science
Role of Human Genome Project in Medical Science
Associate Professor in VSB Coimbatore
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Adina Chuang Howe
 
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Databricks
 
AI Math Agents
AI Math AgentsAI Math Agents
AI Math Agents
Melanie Swan
 
Talk mlm-rule ml-rr-2019
Talk mlm-rule ml-rr-2019Talk mlm-rule ml-rr-2019
Talk mlm-rule ml-rr-2019
AnaOzaki
 
Intro to Biomedical Informatics 701
Intro to Biomedical Informatics 701 Intro to Biomedical Informatics 701
Intro to Biomedical Informatics 701
Chirag Patel
 
Computational Epidemiology as a scientific computing area: cellular automata ...
Computational Epidemiology as a scientific computing area: cellular automata ...Computational Epidemiology as a scientific computing area: cellular automata ...
Computational Epidemiology as a scientific computing area: cellular automata ...
Jones Albuquerque
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
tuxette
 
Identifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta BinkyteIdentifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta Binkyte
Paris Women in Machine Learning and Data Science
 
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks lab
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks labMulti-trait modeling in polygenic scores, journal club talk at Debora Marks lab
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks lab
Yosuke Tanigawa
 
Quality of Life Technologies: From Cure to Care
Quality of Life Technologies: From Cure to CareQuality of Life Technologies: From Cure to Care
Quality of Life Technologies: From Cure to Care
Katarzyna Wac & The QoL Lab
 

Similar to A machine learning approach to clinical terms normalization (12)

Biostatics introduction
Biostatics introductionBiostatics introduction
Biostatics introduction
 
Role of Human Genome Project in Medical Science
Role of Human Genome Project in Medical ScienceRole of Human Genome Project in Medical Science
Role of Human Genome Project in Medical Science
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
 
AI Math Agents
AI Math AgentsAI Math Agents
AI Math Agents
 
Talk mlm-rule ml-rr-2019
Talk mlm-rule ml-rr-2019Talk mlm-rule ml-rr-2019
Talk mlm-rule ml-rr-2019
 
Intro to Biomedical Informatics 701
Intro to Biomedical Informatics 701 Intro to Biomedical Informatics 701
Intro to Biomedical Informatics 701
 
Computational Epidemiology as a scientific computing area: cellular automata ...
Computational Epidemiology as a scientific computing area: cellular automata ...Computational Epidemiology as a scientific computing area: cellular automata ...
Computational Epidemiology as a scientific computing area: cellular automata ...
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
 
Identifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta BinkyteIdentifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta Binkyte
 
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks lab
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks labMulti-trait modeling in polygenic scores, journal club talk at Debora Marks lab
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks lab
 
Quality of Life Technologies: From Cure to Care
Quality of Life Technologies: From Cure to CareQuality of Life Technologies: From Cure to Care
Quality of Life Technologies: From Cure to Care
 

Recently uploaded

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 

Recently uploaded (20)

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 

A machine learning approach to clinical terms normalization

  • 1. A Machine Learning Approach to Clinical Terms Normalization H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti Depto. de Inform´atica en Salud, Hospital Italiano de Buenos Aires hernan.berinsky@hospitalitaliano.org.ar Depto. de Computaci´on, FCEyN, Universidad de Buenos Aires jcastano@dc.uba.ar August 12, 2016 H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 1 / 15
  • 2. Context Terminology Services SNOMED-CT as reference terminology HIBA terminology Interface vocabulary H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 2 / 15
  • 3. Interface vocabulary Objective Semantic recognition of clinical term descriptions Problems (domain) Clinical findings, family history, suspected disease Lexical variability and noise Descriptions contain acronyms, abbreviations, typos, irrelevant data Difficult to develop a rule-based approach due to ’long-tail’ nature of the problem String matching Drawbacks with approximate string matching (fuzzy string matching) e.g. Levenshtein or Jaccard in clinical domain. sospecha de laringitis al´ergica sospecha de faringitis al´ergica sospecha de laringitis al´ergica probable laringitis al´ergica antec fliar de madre con hipotiroidismo antec fliar de padre con hipertiroidismo antecedente familiar de madre con hipotiroidismo madre con hipertiroidismo embarazo 7 semanas embarazo 20 semanas fractura de cadera ayer al mediod´ıa fractura de cadera hace 2 semanas H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 3 / 15
  • 4. Soft-TF-IDF Information retrieval approach (baseline) Build n-gram inverted index (tolerant retrieval) vector of bigram character TF-IDF weighting schema (ltc.nnc) Classification rule: a match if score ≥ t, t a threshold in top result Validation: corpus + queries (partition) Metrics: precision(t), recall(t), F1(t) Precision-recall trade-off controlled by t: precision(t) increasing recall(t) decreasing Evaluation (results): F1 = 0.74 Query sosp faringitis alergica Results description score sosp laringitis al´ergica 0.95 sospecha faringitis al´ergica 0.71 False positive Query antec fliar de ca pulmonar padre biolog Results (not found) description score antecedente familiar de neoplasia 0.44 maligna de pulm´on en padre natural False negative H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 4 / 15
  • 5. Machine learning approach Logic/rule-based approach (knowledge engineering perspective) Difficult to encode system semantics, noise, ambiguity and errors, does not scale up Machine learning approach Learn to match clinical term descriptions based on current knowledge valid/invalid matchings. Steps: Dataset construction Features generation Training (MaxEnt, XGBoost *) Evaluation XGBoost * Gradient boosting Ensemble of trees (weak learners) Additive training, iteratively add tree that most improve the model Regularization: tree complexity, shrinkage, stochastic gradient boosting (bagging) H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 5 / 15
  • 6. Dataset construction For each pair of validated descriptions {d1, d2} in the corpus, create: a positive example if they belong to the same concept (target = 1) a negative example if d2 is a false positive result when query is d1 (target = 0) Corpus (example) concept description sospecha de faringitis sospecha de faringitis sosp faringitis sospecha de laringitis sospecha de laringitis sosp laringitis sos laringitis Dataset d1 d2 target sosp laringitis sospecha laringitis 1 sos faringitis sosp faringitis 1 sospecha de faringitis sospecha de laringitis 0 sosp de faringitis sosp laringitis 0 sos de faringitis sosp laringitis 0 H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 6 / 15
  • 7. Dataset construction Real data looks like... d1 d2 target antec fliar de madre con hipertiroidismo AF de madre con hipertiridismo 1 antec fliar de madre con hipertiroidismo AF de madre con hipotiroidismo 0 madre con hipotiroidismo atc fam madre hipotirosidismo 1 ant fam de padre con diabetes antec familiar padre con diabetes 1 ant familiar de padre con diabetes antec familiar madre con diabetes 0 antecedente fam de padre c´ancer renal AF de padre cari˜n´on 1 ca de piel hace 3 meses neoplasia maligna de piel 1 abandono madre biol´ogica febrero 2002 fuga del hogar de la madre natural 1 muerte por asfixia en incendio forestal fallec por asfixia en un incendio 1 fractura de cadera por un accidente fractura de cadera debido a accidente 1 H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 7 / 15
  • 8. Features S1 L1 = length(d1) L2 = length(d2) m, M = min(L1, L2), max(L1, L2) M − m m/M Levenshteinratio(d1, d2) Jaccard(d1, d2) S2 vector of binary (w, d1) vector of binary (w, d2) S3 vector of TF-IDF (w, d1) vector of TF-IDF (w, d2) S4 vector of TF (b, d1) vector of TF (b, d2) S5 vector of TF-IDF (b, d1) vector of TF-IDF (b, d2) S6 vector of binary (w, d12) vector of binary (w, d21) S7 vector of TF-IDF (b, d12) vector of TF-IDF (b, d21) S8 vector of TF (w, d12) vector of TF (w, d21) vector of TF (w, c) S9 vector of TF (b, d12) vector of TF (b, d21) vector of TF (b, c) S10 Word groups (w, d12) Word groups (w, d21) Word groups (w, c) w: unigram word c: bigram character d12 = words(d1) words(d2) d21 = words(d2) words(d1) c = w(d1) ∩ w(d2) Example (unigram word) d1 = fractura de rodilla izquierda d2 = fractura de rodilla izq w(d1) = {fractura, de, rodilla, izquierda} w(d2) = {fractura, de, rodilla, izq} d12 = {izquierda} d21 = {izq} c = {fractura, de, rodilla} H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 8 / 15
  • 9. Word groups (main idea) Given positive examples in dataset d1 d2 antecedente familiar de madre con hipertiroidismo AF de madre con hipertiroidismo antecedente familiar de padre c´ancer de h´ıgado AF de padre ca de higado ca de piel neoplasia maligna de piel abandono de madre biol´ogica abandono de madre natural muerte por asfixia fallecimiento por asfixia fractura de cadera a causa de accidente fractura de cadera debido a accidente Infer semantic equivalence classes {{deceso, fallecimiento, muerte}, {biol´ogico, natural}, {debido a, a causa de}, {c´ancer, ca, neoplasia maligna}, {renal, de ri˜n´on}} Then Discovered knowledge allow to recognize the following 72 descriptions as semantically equivalent (among others) Duelo por    deceso fallecimiento muerte    de padre biol´ogico natural debido a a causa de    c´ancer ca neoplasia maligna    renal de ri˜n´on H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 9 / 15
  • 10. Word groups d1 d2 target sospecha de dengue probable dengue 1 sospecha de ACV posible ACV 1 sosp tumor renal probable tumor renal 1 ... ... ... Semantic equivalence pairs: C = {(sospecha, probable), (sospecha, posible), (sosp, probable)} Semantic equivalence inference procedure Build a undirected weighted graph G = (V , E, W ) where E = {({d12, d21}, w) : (d12, d21) ∈ C, w = frequency(d12, d21)} Remove edges in G if w < t for some threshold t Find connected components in G What happen with ambiguous concepts? H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 10 / 15
  • 11. Word groups Ambiguous cases Ambiguous words are connected to multiple concepts, e.g. od acronym connected to oido derecho (right ear), ojo derecho (right eye), ovario derecho (right ovary) Multiple concepts are in the same connected component Mitigation Label propagation algorithm (community detection in complex networks) No parameter is required to be known beforehand (e.g. number of clusters) For such connected components we run label propagation algorithm (community detection) Main idea: if a vertex v is connected to v1, ..., vk where each vi has a label L(vi ), each vertex v ∈ G chooses to join the community to which the maximum number of its neighbors belong to (ties broken uniformly randomly) Clustering for is evaluated using modularity measure H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 11 / 15
  • 12. Results Detected: 1,496 semantic equivance classes with 9,152 words Graph 9,926 items (words) (threshold t = 3) Connected components 3,289 items (1,004 unambiguous components) + 6,637 items (ambiguous components) Label propagation (clustering) (1st execution) (6,637 items) 5,831 items (487 unambiguous clusters) + 806 items (ambiguous clusters) Label propagation (clustering) (2nd execution) (806 items) 32 items (5 unambiguous clusters) + 774 items (1 ambiguous cluster) Examples aumento aumento, elevacion, alza, ascenso, incremento boca boca, bucal, bucales, oral, orales, yugal conyugue conyugue, conyuge, esposa, esposo, marido, pareja, novia, matrimonial, maritale cutaneo cutaneo, cutanea, dermatologica, dermica, dermico, piel, peil fractura fractura, fx, fc, fratura, fracura, fract, fr fumador fumador, fumadora, tabaco, tabaquismo, tabaquista, tqb infantil infantil, pediatrico, pedriatica izquierda izquierda, izq, izquierdo, izquierda, izqdo, izqda, izda paciente paciente, pac, pact, pacte, pte, pcte postoperatorio postoperatorio, postquirurgico, postqx, posqx, postop, posop, pop quimioterapia quimioterapia, qmt, qt, quimio, pqt sindrome sindrome, sme, enfermedad, sd, sind, enf, cuadro, sind, sindorme, sdme, sdr, sde traumatismo traumatismo, trauma, tx, trauamtismo, trumatismo, trauma, golpe, tmo H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 12 / 15
  • 13. Classification experiments Unigram word features Features Weight MaxEnt (F1) XGBoost (F1) (S2) d1, d2 binary 0.59 0.59 (S3) d1, d2 tf-idf 0.59 0.58 (S6) d12, d21 binary 0.63 0.62 (S8) d12, d21, c binary 0.76 0.62 Bigram character features Features Weight MaxEnt (F1) XGBoost (F1) (S4) d1, d2 freq. 0.57 0.76 (S5) d1, d2 tf-idf 0.56 0.74 (S7) d12, d21 freq. 0.58 0.76 (S9) d12, d21, c freq. 0.72 0.77 H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 13 / 15
  • 14. Classification experiments Top-1 result Model Prec Rec F1 IR 0.73 0.76 0.74 MaxEnt (S1) 0.66 0.74 0.70 XGBoost (S1) 0.65 0.70 0.68 MaxEnt (S8) 0.74 0.78 0.76 XGBoost (S9) 0.75 0.79 0.77 MaxEnt (S1, S8, S10) 0.87 0.91 0.89 XGBoost (S1, S9, S10) 0.87 0.91 0.89 H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 14 / 15
  • 15. Conclusions & Future work Outperforms Soft-TFIDF (baseline) Does not require lexical knowledge (acronyms, abbreviations, synonyms) nor spell checkers (acquired from examples) Unsupervised learning of synonyms, abbreviations and typos improve results obtained through string similarity features No specific resource for Spanish, our approach can be replicated in any language Possible to use query expansion techniques H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 15 / 15