Data analytics to support exposome research course slidesChirag Patel
We present new publicly available tools to bootstrap your own data-driven investigations to correlate the environment with phenotype. Course materials here: http://www.chiragjpgroup.org/exposome-analytics-course/
Thank You for referencing this work, if you find it useful!
Citation of a related scientific paper:
Wac, K., Wulfovich, S. (2021). Quantifying Quality of Life, Series: Health Informatics, Springer Nature, Cham, Switzerland.
The talk details:
Katarzyna Wac, "Treated by Computers?- a futuristic perspective of health care”: Keynote at the Congress of the European Association of Hospital Pharmacists (EAHP), March 2021
Enhancing Health Risk Assessment:The Individual Exposure Health Risk Profile ...Richard Hartman, Ph.D.
Introducing a new tool in the healthcare toolbox that aims to develop an individual’s comprehensive health profile by combining their genetics, personal exposure, clinical disposition and ultimately integrating the data into the Electronic Health Record called the Individual Exposure Health Risk Profile (IEHRP).
Multi-trait modeling in polygenic scores
2022.03.02 Bioinformatics seminar at University of Osaka, Japan
複数の表現型を考慮したポリジェニック・スコア解析
2022.03.02 バイオインフォマティクスセミナー @ 大阪大学
Data analytics to support exposome research course slidesChirag Patel
We present new publicly available tools to bootstrap your own data-driven investigations to correlate the environment with phenotype. Course materials here: http://www.chiragjpgroup.org/exposome-analytics-course/
Thank You for referencing this work, if you find it useful!
Citation of a related scientific paper:
Wac, K., Wulfovich, S. (2021). Quantifying Quality of Life, Series: Health Informatics, Springer Nature, Cham, Switzerland.
The talk details:
Katarzyna Wac, "Treated by Computers?- a futuristic perspective of health care”: Keynote at the Congress of the European Association of Hospital Pharmacists (EAHP), March 2021
Enhancing Health Risk Assessment:The Individual Exposure Health Risk Profile ...Richard Hartman, Ph.D.
Introducing a new tool in the healthcare toolbox that aims to develop an individual’s comprehensive health profile by combining their genetics, personal exposure, clinical disposition and ultimately integrating the data into the Electronic Health Record called the Individual Exposure Health Risk Profile (IEHRP).
Multi-trait modeling in polygenic scores
2022.03.02 Bioinformatics seminar at University of Osaka, Japan
複数の表現型を考慮したポリジェニック・スコア解析
2022.03.02 バイオインフォマティクスセミナー @ 大阪大学
Human Genome Project is a worldwide scientific achievement. It was a thirteen-year project initiated in 1990 and completed in 2003. Human Genome Project helped a lot in the identification of diseased genes as DNA is very significant for understanding the diseased gene and their functions. It helped in the identification of disease loci for many diseases and presented their treatment through preventive measures. It identified the gene loci for many diseases like cancer, asthma, high blood pressure, diabetes type 2, obesity, Alzheimer's disease, Down's syndrome, Turner's syndrome, depression and many types of heart diseases including cardiovascular disease and coronary artery disease. This project does not directly treat the diseases but it helps in the identification of disease gene loci and then allows the treatment of disease through its preventive measures before the appearance of symptoms or at the initial stages of the disease through many techniques like gene therapy, pharmacogenomics, and targeted drug therapy. These are the helpful techniques in the diagnoses of the human disease gene locus.
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...Databricks
Whole genome sequencing (WGS) has enabled us to quantify human genomic variation at whole genome scale. This has profound impact on improving our understanding of human diversity, health, and diseases. One promising application of WGS is to identify disease-causal genes that can be therapeutically targeted. However, majority of disease-associated variants are located in non-coding regions or so-called genetic deserts, thus the exact function and biological consequences of these variants are unknown. In addition, with numerous variants in linkage disequilibrium (LD), genetic sequence itself is insufficient to infer the likely causal variant(s) among many variants in a region of association. Studies have shown that majority of these variants reside in gene regulatory regions and preferentially in cell type-specific enhancers, providing insights into disease relevance. Novel cutting-edge sequencing technologies to configure 3D genomic structure and to build tissue-specific gene regulatory landscapes can link regulatory elements to their targeted genes. This allows us to associate disease-associated variants and their underlying genes targets.
In this talk, we demonstrate a new approach to incorporate 3D genomic structure and chromatin states of gene regulatory landscapes in a deep learning framework to predict functions of disease-associated variants and their targeted genes. This approach can significantly improve our understanding of the functional importance of those otherwise unknown genetics variants. It allows us to evaluate and prioritize high-impact variants and their targeted genes for development of new drug intervention.
We know that we are in an AI take-off, what is new is that we are in a math take-off. A math take-off is using math as a formal language, beyond the human-facing math-as-math use case, for AI to interface with the computational infrastructure. The message of generative AI and LLMs (large language models like GPT) is not that they speak natural language to humans, but that they speak formal languages (programmatic code, mathematics, physics) to the computational infrastructure, implying the ability to create a much larger problem-solving apparatus for humanity-benefitting applications in biology, energy, and space science, however not without risk.
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks labYosuke Tanigawa
I was invited to give a presentation at the Journal Club meeting at Debora Marks's lab. Here we have the slides for the presentation.
Please visit my website to learn more about this presentation: https://yosuketanigawa.com/talks/2022-01-28-jclub-Marks-lab
Thank You for referencing this work, if you find it useful!
Citation of a related scientific paper:
Manea, V., Wac, K., (2018). mQoL: Mobile Quality of Life Lab: From Behavior Change to QoL, Mobile Human Contributions: Opportunities and Challenges (MHC) Workshop in conjunction with ACM UBICOMP, Singapore, October 2018.
Katarzyna Wac, From Quantified Self to Quality of Life, Book Chapter in "Digital Health", Health Informatics, Springer Nature, p. 83-108, Dordrecht, The Netherlands, 2018.
The talk details:
Katarzyna Wac, “Quality of Life Technologies: From Cure to Care”, Société Suisse des Pharmaciens Hospitaliers (GSASA), November 2018, Switzerland
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
More Related Content
Similar to A machine learning approach to clinical terms normalization
Human Genome Project is a worldwide scientific achievement. It was a thirteen-year project initiated in 1990 and completed in 2003. Human Genome Project helped a lot in the identification of diseased genes as DNA is very significant for understanding the diseased gene and their functions. It helped in the identification of disease loci for many diseases and presented their treatment through preventive measures. It identified the gene loci for many diseases like cancer, asthma, high blood pressure, diabetes type 2, obesity, Alzheimer's disease, Down's syndrome, Turner's syndrome, depression and many types of heart diseases including cardiovascular disease and coronary artery disease. This project does not directly treat the diseases but it helps in the identification of disease gene loci and then allows the treatment of disease through its preventive measures before the appearance of symptoms or at the initial stages of the disease through many techniques like gene therapy, pharmacogenomics, and targeted drug therapy. These are the helpful techniques in the diagnoses of the human disease gene locus.
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...Databricks
Whole genome sequencing (WGS) has enabled us to quantify human genomic variation at whole genome scale. This has profound impact on improving our understanding of human diversity, health, and diseases. One promising application of WGS is to identify disease-causal genes that can be therapeutically targeted. However, majority of disease-associated variants are located in non-coding regions or so-called genetic deserts, thus the exact function and biological consequences of these variants are unknown. In addition, with numerous variants in linkage disequilibrium (LD), genetic sequence itself is insufficient to infer the likely causal variant(s) among many variants in a region of association. Studies have shown that majority of these variants reside in gene regulatory regions and preferentially in cell type-specific enhancers, providing insights into disease relevance. Novel cutting-edge sequencing technologies to configure 3D genomic structure and to build tissue-specific gene regulatory landscapes can link regulatory elements to their targeted genes. This allows us to associate disease-associated variants and their underlying genes targets.
In this talk, we demonstrate a new approach to incorporate 3D genomic structure and chromatin states of gene regulatory landscapes in a deep learning framework to predict functions of disease-associated variants and their targeted genes. This approach can significantly improve our understanding of the functional importance of those otherwise unknown genetics variants. It allows us to evaluate and prioritize high-impact variants and their targeted genes for development of new drug intervention.
We know that we are in an AI take-off, what is new is that we are in a math take-off. A math take-off is using math as a formal language, beyond the human-facing math-as-math use case, for AI to interface with the computational infrastructure. The message of generative AI and LLMs (large language models like GPT) is not that they speak natural language to humans, but that they speak formal languages (programmatic code, mathematics, physics) to the computational infrastructure, implying the ability to create a much larger problem-solving apparatus for humanity-benefitting applications in biology, energy, and space science, however not without risk.
Multi-trait modeling in polygenic scores, journal club talk at Debora Marks labYosuke Tanigawa
I was invited to give a presentation at the Journal Club meeting at Debora Marks's lab. Here we have the slides for the presentation.
Please visit my website to learn more about this presentation: https://yosuketanigawa.com/talks/2022-01-28-jclub-Marks-lab
Thank You for referencing this work, if you find it useful!
Citation of a related scientific paper:
Manea, V., Wac, K., (2018). mQoL: Mobile Quality of Life Lab: From Behavior Change to QoL, Mobile Human Contributions: Opportunities and Challenges (MHC) Workshop in conjunction with ACM UBICOMP, Singapore, October 2018.
Katarzyna Wac, From Quantified Self to Quality of Life, Book Chapter in "Digital Health", Health Informatics, Springer Nature, p. 83-108, Dordrecht, The Netherlands, 2018.
The talk details:
Katarzyna Wac, “Quality of Life Technologies: From Cure to Care”, Société Suisse des Pharmaciens Hospitaliers (GSASA), November 2018, Switzerland
Similar to A machine learning approach to clinical terms normalization (12)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
A machine learning approach to clinical terms normalization
1. A Machine Learning Approach to Clinical Terms
Normalization
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez,
H. Park, M. ´Avila Williams, F. Campos,
D. Luna, S. Benitez, S. Zanetti
Depto. de Inform´atica en Salud, Hospital Italiano de Buenos Aires
hernan.berinsky@hospitalitaliano.org.ar
Depto. de Computaci´on, FCEyN, Universidad de Buenos Aires
jcastano@dc.uba.ar
August 12, 2016
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 1 / 15
2. Context
Terminology Services
SNOMED-CT as reference terminology
HIBA terminology
Interface vocabulary
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 2 / 15
3. Interface vocabulary
Objective
Semantic recognition of clinical term descriptions
Problems (domain)
Clinical findings, family history, suspected disease
Lexical variability and noise
Descriptions contain acronyms, abbreviations, typos, irrelevant data
Difficult to develop a rule-based approach due to ’long-tail’ nature of the
problem
String matching
Drawbacks with approximate string matching (fuzzy string matching) e.g.
Levenshtein or Jaccard in clinical domain.
sospecha de laringitis al´ergica sospecha de faringitis al´ergica
sospecha de laringitis al´ergica probable laringitis al´ergica
antec fliar de madre con hipotiroidismo antec fliar de padre con hipertiroidismo
antecedente familiar de madre con hipotiroidismo madre con hipertiroidismo
embarazo 7 semanas embarazo 20 semanas
fractura de cadera ayer al mediod´ıa fractura de cadera hace 2 semanas
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 3 / 15
4. Soft-TF-IDF Information retrieval approach (baseline)
Build n-gram inverted index (tolerant retrieval)
vector of bigram character TF-IDF weighting schema (ltc.nnc)
Classification rule: a match if score ≥ t, t a threshold in top result
Validation: corpus + queries (partition)
Metrics: precision(t), recall(t), F1(t)
Precision-recall trade-off controlled by t:
precision(t) increasing
recall(t) decreasing
Evaluation (results): F1 = 0.74
Query sosp faringitis alergica
Results
description score
sosp laringitis al´ergica 0.95
sospecha faringitis al´ergica 0.71
False positive
Query antec fliar de ca pulmonar padre biolog
Results (not found)
description score
antecedente familiar de neoplasia 0.44
maligna de pulm´on en padre natural
False negative
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 4 / 15
5. Machine learning approach
Logic/rule-based approach (knowledge engineering perspective)
Difficult to encode system semantics, noise, ambiguity and errors, does not
scale up
Machine learning approach
Learn to match clinical term descriptions based on current knowledge
valid/invalid matchings. Steps:
Dataset construction
Features generation
Training (MaxEnt, XGBoost *)
Evaluation
XGBoost *
Gradient boosting
Ensemble of trees (weak learners)
Additive training, iteratively add tree that most improve the model
Regularization: tree complexity, shrinkage, stochastic gradient boosting
(bagging)
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 5 / 15
6. Dataset construction
For each pair of validated descriptions {d1, d2} in the corpus, create:
a positive example if they belong to the same concept (target = 1)
a negative example if d2 is a false positive result when query is d1 (target = 0)
Corpus (example)
concept description
sospecha de faringitis sospecha de faringitis
sosp faringitis
sospecha de laringitis sospecha de laringitis
sosp laringitis
sos laringitis
Dataset
d1 d2 target
sosp laringitis sospecha laringitis 1
sos faringitis sosp faringitis 1
sospecha de faringitis sospecha de laringitis 0
sosp de faringitis sosp laringitis 0
sos de faringitis sosp laringitis 0
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 6 / 15
7. Dataset construction
Real data looks like...
d1 d2 target
antec fliar de madre con hipertiroidismo AF de madre con hipertiridismo 1
antec fliar de madre con hipertiroidismo AF de madre con hipotiroidismo 0
madre con hipotiroidismo atc fam madre hipotirosidismo 1
ant fam de padre con diabetes antec familiar padre con diabetes 1
ant familiar de padre con diabetes antec familiar madre con diabetes 0
antecedente fam de padre c´ancer renal AF de padre cari˜n´on 1
ca de piel hace 3 meses neoplasia maligna de piel 1
abandono madre biol´ogica febrero 2002 fuga del hogar de la madre natural 1
muerte por asfixia en incendio forestal fallec por asfixia en un incendio 1
fractura de cadera por un accidente fractura de cadera debido a accidente 1
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 7 / 15
8. Features
S1 L1 = length(d1)
L2 = length(d2)
m, M = min(L1, L2), max(L1, L2)
M − m
m/M
Levenshteinratio(d1, d2)
Jaccard(d1, d2)
S2 vector of binary (w, d1)
vector of binary (w, d2)
S3 vector of TF-IDF (w, d1)
vector of TF-IDF (w, d2)
S4 vector of TF (b, d1)
vector of TF (b, d2)
S5 vector of TF-IDF (b, d1)
vector of TF-IDF (b, d2)
S6 vector of binary (w, d12)
vector of binary (w, d21)
S7 vector of TF-IDF (b, d12)
vector of TF-IDF (b, d21)
S8 vector of TF (w, d12)
vector of TF (w, d21)
vector of TF (w, c)
S9 vector of TF (b, d12)
vector of TF (b, d21)
vector of TF (b, c)
S10 Word groups (w, d12)
Word groups (w, d21)
Word groups (w, c)
w: unigram word
c: bigram character
d12 = words(d1) words(d2)
d21 = words(d2) words(d1)
c = w(d1) ∩ w(d2)
Example (unigram word)
d1 = fractura de rodilla izquierda
d2 = fractura de rodilla izq
w(d1) = {fractura, de, rodilla, izquierda}
w(d2) = {fractura, de, rodilla, izq}
d12 = {izquierda}
d21 = {izq}
c = {fractura, de, rodilla}
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 8 / 15
9. Word groups (main idea)
Given positive examples in dataset
d1 d2
antecedente familiar de madre con hipertiroidismo AF de madre con hipertiroidismo
antecedente familiar de padre c´ancer de h´ıgado AF de padre ca de higado
ca de piel neoplasia maligna de piel
abandono de madre biol´ogica abandono de madre natural
muerte por asfixia fallecimiento por asfixia
fractura de cadera a causa de accidente fractura de cadera debido a accidente
Infer semantic equivalence classes
{{deceso, fallecimiento, muerte}, {biol´ogico, natural}, {debido a, a causa de}, {c´ancer, ca,
neoplasia maligna}, {renal, de ri˜n´on}}
Then
Discovered knowledge allow to recognize the following 72 descriptions as semantically equivalent
(among others)
Duelo por
deceso
fallecimiento
muerte
de padre
biol´ogico
natural
debido a
a causa de
c´ancer
ca
neoplasia maligna
renal
de ri˜n´on
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 9 / 15
10. Word groups
d1 d2 target
sospecha de dengue probable dengue 1
sospecha de ACV posible ACV 1
sosp tumor renal probable tumor renal 1
... ... ...
Semantic equivalence pairs:
C =
{(sospecha, probable), (sospecha, posible), (sosp, probable)}
Semantic equivalence inference procedure
Build a undirected weighted graph
G = (V , E, W )
where E = {({d12, d21}, w) : (d12, d21) ∈ C,
w = frequency(d12, d21)}
Remove edges in G if w < t for some
threshold t
Find connected components in G
What happen with ambiguous concepts?
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 10 / 15
11. Word groups
Ambiguous cases
Ambiguous words are connected to multiple concepts, e.g. od acronym connected to
oido derecho (right ear), ojo derecho (right eye), ovario derecho (right ovary)
Multiple concepts are in the same connected component
Mitigation
Label propagation algorithm (community detection in complex networks)
No parameter is required to be known beforehand (e.g. number of clusters)
For such connected components we run label propagation algorithm (community
detection)
Main idea: if a vertex v is connected to v1, ..., vk where each vi has a label L(vi ), each
vertex v ∈ G chooses to join the community to which the maximum number of its
neighbors belong to (ties broken uniformly randomly)
Clustering for is evaluated using modularity measure
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 11 / 15
12. Results
Detected: 1,496 semantic equivance classes with 9,152 words
Graph 9,926 items (words) (threshold t = 3)
Connected components
3,289 items (1,004 unambiguous components) + 6,637 items (ambiguous components)
Label propagation (clustering) (1st execution) (6,637 items)
5,831 items (487 unambiguous clusters) + 806 items (ambiguous clusters)
Label propagation (clustering) (2nd execution) (806 items)
32 items (5 unambiguous clusters) + 774 items (1 ambiguous cluster)
Examples
aumento aumento, elevacion, alza, ascenso, incremento
boca boca, bucal, bucales, oral, orales, yugal
conyugue conyugue, conyuge, esposa, esposo, marido, pareja, novia, matrimonial, maritale
cutaneo cutaneo, cutanea, dermatologica, dermica, dermico, piel, peil
fractura fractura, fx, fc, fratura, fracura, fract, fr
fumador fumador, fumadora, tabaco, tabaquismo, tabaquista, tqb
infantil infantil, pediatrico, pedriatica
izquierda izquierda, izq, izquierdo, izquierda, izqdo, izqda, izda
paciente paciente, pac, pact, pacte, pte, pcte
postoperatorio postoperatorio, postquirurgico, postqx, posqx, postop, posop, pop
quimioterapia quimioterapia, qmt, qt, quimio, pqt
sindrome sindrome, sme, enfermedad, sd, sind, enf, cuadro, sind, sindorme, sdme, sdr, sde
traumatismo traumatismo, trauma, tx, trauamtismo, trumatismo, trauma, golpe, tmo
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 12 / 15
13. Classification experiments
Unigram word features
Features Weight MaxEnt (F1) XGBoost (F1)
(S2) d1, d2 binary 0.59 0.59
(S3) d1, d2 tf-idf 0.59 0.58
(S6) d12, d21 binary 0.63 0.62
(S8) d12, d21, c binary 0.76 0.62
Bigram character features
Features Weight MaxEnt (F1) XGBoost (F1)
(S4) d1, d2 freq. 0.57 0.76
(S5) d1, d2 tf-idf 0.56 0.74
(S7) d12, d21 freq. 0.58 0.76
(S9) d12, d21, c freq. 0.72 0.77
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 13 / 15
14. Classification experiments
Top-1 result
Model Prec Rec F1
IR 0.73 0.76 0.74
MaxEnt (S1) 0.66 0.74 0.70
XGBoost (S1) 0.65 0.70 0.68
MaxEnt (S8) 0.74 0.78 0.76
XGBoost (S9) 0.75 0.79 0.77
MaxEnt (S1, S8, S10) 0.87 0.91 0.89
XGBoost (S1, S9, S10) 0.87 0.91 0.89
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 14 / 15
15. Conclusions & Future work
Outperforms Soft-TFIDF (baseline)
Does not require lexical knowledge (acronyms, abbreviations, synonyms) nor
spell checkers (acquired from examples)
Unsupervised learning of synonyms, abbreviations and typos improve results
obtained through string similarity features
No specific resource for Spanish, our approach can be replicated in any
language
Possible to use query expansion techniques
H. Berinsky, J. Casta˜no, M. Gambarte, D. Perez, H. Park, M. ´Avila Williams, F. Campos, D. Luna, S. Benitez, S. Zanetti (HIBA)Hospital Italiano de Buenos Aires August 12, 2016 15 / 15