Envisioning a world where everyone helps solve diseasemhaendel
Keynote presented at the Semantic Web for Life Sciences conference in Cambridge, UK, December 9th, 2015
http://www.swat4ls.org/
The talk focuses on the use of ontologies for data integration to support rare disease diagnostics, and how so very many people unbeknownst to the patient or even to the researchers creating the data are involved in a diagnosis.
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMichel Dumontier
A phenotype is an observable characteristic of an individually and typically pertains to its morphology, function, and behavior. Phenotypes, whether observed at the bench or the bedside, are increasingly being used to gain insight into the diagnosis, mechanism, and treatment of disease. A key aspect of these approaches involve comparing phenotypes that are defined in multiple terminologies that often cater to altogether different organisms, such as mice and humans. In this seminar, I will discuss computational approaches for harmonizing and utilizing phenotypes for translational research. We will examine case studies which involve the computation of semantic similarity including the use of phenotypes to inform clinical diagnosis of rare diseases, to identify human drug targets using mice knock-out models, and to explore phenotype-based approaches for drug repositioning .
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
NIH BD2K all-hands meeting poster November 12, 2015.
Attempts at correlating phenotypic aspects of disease with causal genetic influences are often confounded by the challenges of interpreting diverse data distributed across numerous resources. New approaches to data modeling, integration, tooling, and community practices are needed to make efficient use of these data. The Monarch Initiative is an international consortium working on the development of shared data, tools, and standards to enable direct translation of integrated genotype, phenotype, and environmental data from human and model organisms to enhance our understanding of human disease. We utilize sophisticated semantic mapping techniques across a diverse set of standardized ontologies to deeply integrate data across species, sources, and modalities. Using phenotype similarity matching algorithms across these data enables disorder prediction, variant prioritization, and patient matching against known diseases and model organisms. These similarity algorithms form the core of several innovative tools. The Exomiser, which enables exome variant prioritization by combining pathogenicity, frequency, inheritance, protein interaction, and cross-species phenotype data. Our Phenotype Sufficiency tool provides clinicians the ability to compare patient phenotypic profiles using the Human Phenotype Ontology to determine uniqueness and specificity in support of variant prioritization. The PhenoGrid visualization widget illustrates phenotype similarity between patients, known diseases, and model organisms. Monarch develops models in collaboration with the community in support of the burgeoning genotype-phenotype disease research community. We have successfully used Exomiser to solve a number of undiagnosed patient cases in collaboration with the NIH Undiagnosed Disease Program. Ongoing development in coordination with the Global Alliance for Genetic Health (GA4GH) and other groups will catalyze the realization of our goal of a vital translational community focused on the collaborative application of integrated genotype, phenotype, and environmental data to human disease.
Envisioning a world where everyone helps solve diseasemhaendel
Keynote presented at the Semantic Web for Life Sciences conference in Cambridge, UK, December 9th, 2015
http://www.swat4ls.org/
The talk focuses on the use of ontologies for data integration to support rare disease diagnostics, and how so very many people unbeknownst to the patient or even to the researchers creating the data are involved in a diagnosis.
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMichel Dumontier
A phenotype is an observable characteristic of an individually and typically pertains to its morphology, function, and behavior. Phenotypes, whether observed at the bench or the bedside, are increasingly being used to gain insight into the diagnosis, mechanism, and treatment of disease. A key aspect of these approaches involve comparing phenotypes that are defined in multiple terminologies that often cater to altogether different organisms, such as mice and humans. In this seminar, I will discuss computational approaches for harmonizing and utilizing phenotypes for translational research. We will examine case studies which involve the computation of semantic similarity including the use of phenotypes to inform clinical diagnosis of rare diseases, to identify human drug targets using mice knock-out models, and to explore phenotype-based approaches for drug repositioning .
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
NIH BD2K all-hands meeting poster November 12, 2015.
Attempts at correlating phenotypic aspects of disease with causal genetic influences are often confounded by the challenges of interpreting diverse data distributed across numerous resources. New approaches to data modeling, integration, tooling, and community practices are needed to make efficient use of these data. The Monarch Initiative is an international consortium working on the development of shared data, tools, and standards to enable direct translation of integrated genotype, phenotype, and environmental data from human and model organisms to enhance our understanding of human disease. We utilize sophisticated semantic mapping techniques across a diverse set of standardized ontologies to deeply integrate data across species, sources, and modalities. Using phenotype similarity matching algorithms across these data enables disorder prediction, variant prioritization, and patient matching against known diseases and model organisms. These similarity algorithms form the core of several innovative tools. The Exomiser, which enables exome variant prioritization by combining pathogenicity, frequency, inheritance, protein interaction, and cross-species phenotype data. Our Phenotype Sufficiency tool provides clinicians the ability to compare patient phenotypic profiles using the Human Phenotype Ontology to determine uniqueness and specificity in support of variant prioritization. The PhenoGrid visualization widget illustrates phenotype similarity between patients, known diseases, and model organisms. Monarch develops models in collaboration with the community in support of the burgeoning genotype-phenotype disease research community. We have successfully used Exomiser to solve a number of undiagnosed patient cases in collaboration with the NIH Undiagnosed Disease Program. Ongoing development in coordination with the Global Alliance for Genetic Health (GA4GH) and other groups will catalyze the realization of our goal of a vital translational community focused on the collaborative application of integrated genotype, phenotype, and environmental data to human disease.
This talk gives an introduction to entity linking for biomedical data. It describes the problem to be solved as a three stage task and links to state of the art approaches for these steps.
Talk held at the Hamburg Data Science Meetup, Hamburgs largest data event.
Usage of open source software for Real World Data Analysis in pharmaceutical ...Kees van Bochove
An upcoming area of interest for biopharmaceutical product development, as well as for public health and healthcare system evaluation, is the study of medical outcomes in so-called 'real world data'. This data can originate from electronic medical records in hospitals, general practitioners, pharmacies, insurance companies and even directly from patients, using forums or mobile health apps.
One of the largest open source initiatives for the standardisation and analysis for this type of data is called OHDSI: Observational Health Data Sciences and Informatics. OHDSI leverages the OMOP data model for observational data, and provides data analysis tools for a broad range of use cases. This talk will focus on a number of examples of the application of the OHDSI tooling for observational research, as well as provide a broader introduction of the topic and the use of open source software in pharmaceutical and healthcare context.
The presenter, Kees van Bochove, is founder and CEO of The Hyve, a company based in Utrecht, The Netherlands and Cambridge, MA, US that provides services around open source software in bioinformatics and translational research, such as OHDSI, tranSMART and cBioPortal.
Data Visualization in Biomedical Sciences: More than Meets the EyeNils Gehlenborg
In science, data visualization serves two primary purposes. The first is to explore data sets interactively and the second is to communicate discoveries. However, the requirements for visualizations employed in these activities are very different. Therefore, the software tools used for these purposes are typically disconnected, creating significant challenges for reproducibility and effective communication of discoveries in data-driven biomedical science. In this presentation, I will address how a new approach to creating data visualization tools can connect data analysts and other stakeholders inside and outside the scientific community. I will introduce and demonstrate the "Vistories" approach that was motivated by these question.
Presented at the 5th Cancer Research UK Big Data Analytics Conference on Data Visualization.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...Human Variome Project
The success of whole exome sequencing (WES) for highly heterogeneous disorders, such as mitochondrial disease, is limited by substantial technical and bioinformatics challenges to correctly identify and prioritize the extensive number of sequence variants present in each patient. The likelihood of success can be greatly improved if a large cohort of patient data is assembled in which sequence variants can be systematically analysed, annotated, and interpreted relative to known phenotype. This effort has engaged and united more than 100 international mitochondrial clinicians, researchers, and bioinformaticians in the Mitochondrial Disease Sequence Data Resource (MSeqDR) consortium that formed in June 2012 to identify and prioritize the specific WES data analysis needs of the global mitochondrial disease community. Through regular web-based meetings, we have familiarized ourselves with existing strengths and gaps facing integration of MSeqDR with public resources, as well as the major practical, technical, and ethical challenges that must be overcome to create a sustainable data resource. We have now moved forward toward our common goal by establishing a central data resource (http://mseqdr.org/) that has both public access and secure web-based features that allow the coherent compilation, organization, annotation, and analysis of WES and mtDNA genome data sets generated in both clinical- and research-based settings of suspected mitochondrial disease patients. The most important aims of the MSeqDR consortium are summarized in the MSeqDR portal within the Consortium overview sections. Consortium participants are organized in 3 working groups that include (1) Technology and Bioinformatics; (2) Phenotyping, databasing, IRB concerns and access; and (3) Mitochondrial DNA specific concerns. The online MSeqDR resource is organized into discrete sections to facilitate data deposition and common reannotation, data visualization, data set mining, and access management. With the support of the United Mitochondrial Disease Foundation (UMDF) and the NINDS/NICHD U54 supported North American Mitochondrial Disease Consortium (NAMDC), the MSeqDR prototype has been built. Current major components include common data upload and reannotation using a novel HBCR based annotation tool that has also been made publicly available through the website, MSeqDR GBrowse that allows ready visualization of all public and MSeqDR specific data including labspecific aggregate data visualization tracks, MSeqDR-LSDB instance of nearly 1250 mitochondrial disease and mitochodnrial localized genes that is based on the Locus Specific Database model, exome data set mining in individuals or families using the GEM.app tool, and Account & Access Management. Within MSeqDR GBrowse it is now possible to explore data derived from MitoMap, HmtDB, ClinVar, UCSC-NumtS, ENCODE, 1000 genomes, and many other resources that bioinformaticians recruited to the project are organizing.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
This talk gives an introduction to entity linking for biomedical data. It describes the problem to be solved as a three stage task and links to state of the art approaches for these steps.
Talk held at the Hamburg Data Science Meetup, Hamburgs largest data event.
Usage of open source software for Real World Data Analysis in pharmaceutical ...Kees van Bochove
An upcoming area of interest for biopharmaceutical product development, as well as for public health and healthcare system evaluation, is the study of medical outcomes in so-called 'real world data'. This data can originate from electronic medical records in hospitals, general practitioners, pharmacies, insurance companies and even directly from patients, using forums or mobile health apps.
One of the largest open source initiatives for the standardisation and analysis for this type of data is called OHDSI: Observational Health Data Sciences and Informatics. OHDSI leverages the OMOP data model for observational data, and provides data analysis tools for a broad range of use cases. This talk will focus on a number of examples of the application of the OHDSI tooling for observational research, as well as provide a broader introduction of the topic and the use of open source software in pharmaceutical and healthcare context.
The presenter, Kees van Bochove, is founder and CEO of The Hyve, a company based in Utrecht, The Netherlands and Cambridge, MA, US that provides services around open source software in bioinformatics and translational research, such as OHDSI, tranSMART and cBioPortal.
Data Visualization in Biomedical Sciences: More than Meets the EyeNils Gehlenborg
In science, data visualization serves two primary purposes. The first is to explore data sets interactively and the second is to communicate discoveries. However, the requirements for visualizations employed in these activities are very different. Therefore, the software tools used for these purposes are typically disconnected, creating significant challenges for reproducibility and effective communication of discoveries in data-driven biomedical science. In this presentation, I will address how a new approach to creating data visualization tools can connect data analysts and other stakeholders inside and outside the scientific community. I will introduce and demonstrate the "Vistories" approach that was motivated by these question.
Presented at the 5th Cancer Research UK Big Data Analytics Conference on Data Visualization.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...Human Variome Project
The success of whole exome sequencing (WES) for highly heterogeneous disorders, such as mitochondrial disease, is limited by substantial technical and bioinformatics challenges to correctly identify and prioritize the extensive number of sequence variants present in each patient. The likelihood of success can be greatly improved if a large cohort of patient data is assembled in which sequence variants can be systematically analysed, annotated, and interpreted relative to known phenotype. This effort has engaged and united more than 100 international mitochondrial clinicians, researchers, and bioinformaticians in the Mitochondrial Disease Sequence Data Resource (MSeqDR) consortium that formed in June 2012 to identify and prioritize the specific WES data analysis needs of the global mitochondrial disease community. Through regular web-based meetings, we have familiarized ourselves with existing strengths and gaps facing integration of MSeqDR with public resources, as well as the major practical, technical, and ethical challenges that must be overcome to create a sustainable data resource. We have now moved forward toward our common goal by establishing a central data resource (http://mseqdr.org/) that has both public access and secure web-based features that allow the coherent compilation, organization, annotation, and analysis of WES and mtDNA genome data sets generated in both clinical- and research-based settings of suspected mitochondrial disease patients. The most important aims of the MSeqDR consortium are summarized in the MSeqDR portal within the Consortium overview sections. Consortium participants are organized in 3 working groups that include (1) Technology and Bioinformatics; (2) Phenotyping, databasing, IRB concerns and access; and (3) Mitochondrial DNA specific concerns. The online MSeqDR resource is organized into discrete sections to facilitate data deposition and common reannotation, data visualization, data set mining, and access management. With the support of the United Mitochondrial Disease Foundation (UMDF) and the NINDS/NICHD U54 supported North American Mitochondrial Disease Consortium (NAMDC), the MSeqDR prototype has been built. Current major components include common data upload and reannotation using a novel HBCR based annotation tool that has also been made publicly available through the website, MSeqDR GBrowse that allows ready visualization of all public and MSeqDR specific data including labspecific aggregate data visualization tracks, MSeqDR-LSDB instance of nearly 1250 mitochondrial disease and mitochodnrial localized genes that is based on the Locus Specific Database model, exome data set mining in individuals or families using the GEM.app tool, and Account & Access Management. Within MSeqDR GBrowse it is now possible to explore data derived from MitoMap, HmtDB, ClinVar, UCSC-NumtS, ENCODE, 1000 genomes, and many other resources that bioinformaticians recruited to the project are organizing.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
2. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
3. About 80% of
Electronic Health
Records are in
unstructured format
Need for NLP tools for
processing clinical text
Lack of multilingual
terminology
resources and
domain specific
ontologies
The automatic processing and knowledge extraction from
medical records is a task with public importance
4. Clinical text
HISTORY OF PRESENT ILLNESS :The patient is an 80 female with
a history of diastolic function and heart failure , hypertension and
rheumatoid arthritis who presents from an outside hospital with
presyncope.
5. Clinical text
OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac
ultrasound , EGD , chest x-ray , PICC placement .The patient is a
62-year-old female with a history of diabetes mellitus ,
hypertension , COPD , hypercholesterolemia , depression and CHF
6. Clinical text
HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old
woman transferred for evaluation of thrombotic thrombocytopenic
purpura and bronchiolitis obliterans organizing pneumonia .
7. Why the task for concept normalization
is so important?
o Disambiguation
o Usage of URI
o Data integration
o Reasoning
o Similarity search
o Phenotypes
14. How to find training data?
o For 150000 classes we will need huge training dataset
o Clinical data are not publicly available due to GDPR issues
o There are quite few manually annotate datasets
o We need to rely only on publicly available sources:
− Other standard classifications and ontologies
− Open data
23. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
30. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
31. Medical Ontologies Mappings
o 1:1
o 1:N
o N:M
o No mappings
Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu
32. ExaMode dataset
Dataset version 1
• Summary:
– 22M+ data records
• 128K+ SNOMED codes
• 280K+ textual descriptions
- 17K+ undiscovered connections
32
33. Dataset Generation
o More data – more problems
o Data cleaning
o Unbalanced dataset
o Overrepresented vs underrepresented classes
34. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
35. Data Augmentation
o The original idea for dataset enlargement
− Datasets with images for Neural networks training
o Popular techniques:
− Flip
− Rotation
36. Data Augmentation
o Popular techniques:
− Scale
− Crop
− Translate
− Pixel/Region change (fill with constant)
− Pixel/Region swap
− ….
37. Types of data augmentation that are applicable
for textual data
o Swap random letters within a single word
o Swap random words within a text
o Replace word with its synonim
o Delete random letter within a single word
o Replace a random letter with a letter close to it on the keyboard
38. ExaMode dataset
Dataset version 2 Remove noise
• Additional data augmentations
• Additional heuristics
• Additional data cleaning
• Split the dataset into 3 subgroups:
– Disorders
– Procedures
– Findings
38
40. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
42. Binary classification
o Samples takes only 1 label out of 2 classes
Review Sentiment
Delivered as expected Positive
Good quality Positive
There are scratches on the surface Negative
Works great Positive
I do not recommend it Negative
43. Multiclass classification
o Samples takes only 1 label out of number of classes
Movie Rating
Palmer 7
Bad Trip 6
Godzilla vs. Kong 6
Band of Brothers 9
Big fish 8
44. Multilabel classification
o Samples takes one or more than one labels out of number
of classes
Movie Drama Comedy Action Sci-Fi War Adventure Fantasy
Palmer 1 0 0 0 0 0 0
Bad Trip 0 1 0 0 0 0 0
Godzilla vs. Kong 0 0 1 1 0 0 0
Band of Brothers 1 0 1 0 1 0 0
Big fish 1 0 0 0 0 1 1
45. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
46. Classification model
o BERT (Bidirectional Encoder Representations from
Transformers)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.
48. Classification model
o BERT core idea
Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
50. Classification model
o BERT advantages
o Incredible performance
o Open source
o Easy to pretrain with small amount of medical data
51. Classification model
o BERT pretrained models:
o bioBERT
o multilingualBERT
o slavicBERT
o clinicalBERT
o pubmedBERT
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of NAACL, 2019.
Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Bioinformatics, 2019.
Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019.
Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert
embeddings. In ClinicalNLP workshop at NAACL, 2019.
Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.
52. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
53. Embeddings
o Student: [2, 7]
o School: [3, 6]
o University: [1, 5]
o Dog: [6, 2.5]
o Cat: [5, 2]
o Fish: [7.5, 1]
54. Embeddings
o Deep learning embeddings
Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
55. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
62. Acknowledgements
o Alexander Tahchiev
o Andrey Avramov
o Hristo Papazov
o Pavlin Gyurov
o Todor Primov
o Stanislav Slavkov
https://www.datasciencesociety.net/
https://www.ontotext.com
63. Thank you!
See Ontotext Platform demos
Star Wars API: https://swapi-platform.ontotext.com/graphiql/
Platform monitoring: https://test-platform.ontotext.com/grafana/