There are ongoing large-scale efforts to catalog genomic variation related to disease in structured databases. Much of the relevant information is available only from unstructured sources, including the scientific literature. In our work, we have explored the ability of text mining tools to recover the mutations catalogued in curated databases based on the article text, specifically examining the recovery of mutations in the COSMIC and InSiGHT databases. We demonstrate that there are excellent tools for extraction of mutation mentions from the literature, but that the recovery of the information in databases is far less than what would be expected based on that tool performance, even when full text articles are available. I will present an analysis in which we explore the impact of processing tables and supplementary material associated to relevant literature, demonstrating that the coverage of variants improves dramatically, from 2% to over 50%. I will further present the Variome corpus, a small collection of full text publications annotated with relationships such as gene-disease and mutation-disease relationships, and introduce our recent efforts to develop strategies to extract this relational information from the literature. Joint work with Antonio Jimeno Yepes (IBM Research) and Min Song (Yonsei University).
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)Karin Verspoor
A 68-year-old female presented to the emergency department with shortness of breath that had progressively worsened over 4-5 days. She was admitted and experienced intermittent severe shortness of breath. Testing found a positive D-dimer and chest imaging could not be done, so a nuclear scan was ordered but the patient was too unstable. Given the positive D-dimer and severity of symptoms, the patient was treated with anticoagulants due to concerns for a possible pulmonary embolism. She was admitted under a doctor for further care.
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCancerImagingInforma
The document discusses a crowdsourcing experiment conducted at RSNA 2018 to collect cancer image annotations from over 250 participants. Preliminary results found promising agreement between crowd annotations and expert ground truths. Feedback from participants suggested improvements to the annotation interface such as tutorials, measurement tools, and metrics display. Overall the experiment demonstrated the potential for crowdsourcing to efficiently generate large annotated medical image datasets.
This document discusses protein identification from mass spectrometry data. It describes how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses can then be used to reconstruct the peptide sequence de novo or search protein databases to identify the source protein. Key algorithms discussed include de novo sequencing, database search tools like Sequest and Mascot, and techniques like InsPecT that can rapidly search large databases or analyze many spectra.
This document provides an overview of protein identification using mass spectrometry. It discusses how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses are then used to reconstruct peptides through de novo sequencing or database searching against a protein database. The document compares de novo sequencing, which reconstructs peptides from fragment ion masses, to database searching, which matches experimental spectra to theoretical spectra from a database.
This document summarizes Christopher Mason's presentation on epigenetics quality control and single-cell RNA-seq variant calling using samples from the Genome in a Bottle project. It discusses generating reference epigenetics datasets, including whole genome bisulfite sequencing data, Illumina 450K methylation array data, and targeted bisulfite sequencing data for several GIAB samples. Parameters for variant calling from single-cell RNA-seq data are evaluated, finding best sensitivity and specificity at 97% and 80% respectively using certain settings. The work aims to establish high quality epigenetics and variant calling references to help benchmark computational methods for personalized medicine.
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
Jake Lever - University of Glasgow
Will artificial intelligence change how readers use the research literature?
Huge advances in machine learning and natural language processing are set to upend how researchers search and consume research articles as well as change how articles are written. These new approaches are becoming adept at summarising and rewriting text, answering questions about it and extracting key information. These abilities will enable humans to search for information in new ways, such as the new ChatGPT system. They are valuable tools for researchers who curate the research literature to build knowledge bases particularly in biomedicine. Nevertheless, these approaches suffer from large problems including their computational cost and that they can confidently output incorrect information. This session will provide background on how these new methods work and discuss their benefits, challenges and potential impact.
Metabolomics in the 21st century - perspectiveDinesh Barupal
This document discusses metabolomics as a service (MaaS) and the role of metabolomics core facilities in providing metabolomics data generation and analysis services. It outlines the key aspects of metabolomics including measuring small molecule chemicals in biological samples, the role of metabolism in health and disease, and approaches for blood metabolomics. The document advocates that metabolomics core facilities can provide cost-effective, high-quality metabolomics services including standardized assays, large-scale data generation, robust data analysis, and developing metabolite libraries. This would support metabolic epidemiology and clinical research projects.
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)Karin Verspoor
A 68-year-old female presented to the emergency department with shortness of breath that had progressively worsened over 4-5 days. She was admitted and experienced intermittent severe shortness of breath. Testing found a positive D-dimer and chest imaging could not be done, so a nuclear scan was ordered but the patient was too unstable. Given the positive D-dimer and severity of symptoms, the patient was treated with anticoagulants due to concerns for a possible pulmonary embolism. She was admitted under a doctor for further care.
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCancerImagingInforma
The document discusses a crowdsourcing experiment conducted at RSNA 2018 to collect cancer image annotations from over 250 participants. Preliminary results found promising agreement between crowd annotations and expert ground truths. Feedback from participants suggested improvements to the annotation interface such as tutorials, measurement tools, and metrics display. Overall the experiment demonstrated the potential for crowdsourcing to efficiently generate large annotated medical image datasets.
This document discusses protein identification from mass spectrometry data. It describes how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses can then be used to reconstruct the peptide sequence de novo or search protein databases to identify the source protein. Key algorithms discussed include de novo sequencing, database search tools like Sequest and Mascot, and techniques like InsPecT that can rapidly search large databases or analyze many spectra.
This document provides an overview of protein identification using mass spectrometry. It discusses how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses are then used to reconstruct peptides through de novo sequencing or database searching against a protein database. The document compares de novo sequencing, which reconstructs peptides from fragment ion masses, to database searching, which matches experimental spectra to theoretical spectra from a database.
This document summarizes Christopher Mason's presentation on epigenetics quality control and single-cell RNA-seq variant calling using samples from the Genome in a Bottle project. It discusses generating reference epigenetics datasets, including whole genome bisulfite sequencing data, Illumina 450K methylation array data, and targeted bisulfite sequencing data for several GIAB samples. Parameters for variant calling from single-cell RNA-seq data are evaluated, finding best sensitivity and specificity at 97% and 80% respectively using certain settings. The work aims to establish high quality epigenetics and variant calling references to help benchmark computational methods for personalized medicine.
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
Jake Lever - University of Glasgow
Will artificial intelligence change how readers use the research literature?
Huge advances in machine learning and natural language processing are set to upend how researchers search and consume research articles as well as change how articles are written. These new approaches are becoming adept at summarising and rewriting text, answering questions about it and extracting key information. These abilities will enable humans to search for information in new ways, such as the new ChatGPT system. They are valuable tools for researchers who curate the research literature to build knowledge bases particularly in biomedicine. Nevertheless, these approaches suffer from large problems including their computational cost and that they can confidently output incorrect information. This session will provide background on how these new methods work and discuss their benefits, challenges and potential impact.
Metabolomics in the 21st century - perspectiveDinesh Barupal
This document discusses metabolomics as a service (MaaS) and the role of metabolomics core facilities in providing metabolomics data generation and analysis services. It outlines the key aspects of metabolomics including measuring small molecule chemicals in biological samples, the role of metabolism in health and disease, and approaches for blood metabolomics. The document advocates that metabolomics core facilities can provide cost-effective, high-quality metabolomics services including standardized assays, large-scale data generation, robust data analysis, and developing metabolite libraries. This would support metabolic epidemiology and clinical research projects.
Learn how to use Pathway Studio to explore biomarkers and brain regions. With the addition of highly sophisticated visualization tools, users can interactively explore the vast number of connections created to help unravel disease biology. In addition, an innovative new taxonomy based on brain region identifications will be presented. Together, these innovations can be applied to rapidly increase the knowledge of diseases based on published findings.
Genetic architecture of developmental traits in populations of male gypsy mothscfriedline
This document summarizes a study on the genetic architecture of developmental traits in gypsy moth populations. The study established 7 gypsy moth populations in common gardens and sequenced 188 individuals to identify 11,021 SNPs. Three phenotypes - pupal duration, mass, and total development time - were measured. Population structure was corrected using PCA. Several SNPs were significantly associated with each trait, though effect sizes were small. Multilocus models explained over 50% of trait variation. Future work could involve refining the genome assembly and studying additional populations to detect smaller genetic effects.
François Fauteux, National Research Council Canada, Emerging strategies for computational ADC target selection and prioritization, World ADC 2017, San Diego
Flow Cytometry Training talks - part 1
This forms the first session of the Garvan Flow , Flow Cytometry Training course. this is a 1 1/2 day training course aimed at giving new and experienced researchers a better understanding of cytometry in medical and biological research.
Abstract: Ontologies are used in numerous research disciplines and commercial applications to uniformly and semantically annotate real-world objects. Due to a rapid development of application domains the corresponding ontologies are changed frequently to include up-to-date knowledge. These changes dramatically influence dependent data as well as applications/systems, for instance, ontology mappings, that semantically interrelate ontologies. The talk will give an overview on evolution of ontologies and ontology-based mappings.
Bioinformatics issues and challanges presentation at s p collegeSKUASTKashmir
This document provides an overview of bioinformatics and some key concepts:
- It discusses the exponential growth of biological data from technologies like PCR and microarrays, and how bioinformatics is needed to analyze this data.
- Bioinformatics is defined as integrating biology and computer science to collect, analyze, and interpret large amounts of molecular-level information. It uses databases and tools to study genomes, proteins, and biological processes.
- Major databases like GenBank, EMBL, and SwissProt store DNA, RNA, protein sequences and provide access to researchers. Tools like BLAST are used to search databases and analyze sequences.
- Benefits of bioinformatics include advances in medicine, agriculture, forensics
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
This document discusses using biological network approaches to dynamically extend pathways with regulatory information such as microRNAs (miRNAs). It describes tools like PathVisio that can integrate gene expression, proteomics and metabolomics data onto pathways to identify significantly changed processes. WikiPathways is introduced as a public pathway resource that can be contributed to and curated by researchers. The document outlines approaches for visualizing regulatory interactions on pathways using plugins, exploring pathway interactions through network analysis, and integrating other data types such as SNPs, fluxes and gene annotations to build a more comprehensive understanding of biological systems.
Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
This document provides information about a training workshop on using Ensembl. It includes an agenda for the day-long workshop covering topics like introduction to Ensembl, browsing genes and data, using BioMart, and exploring genetic variation data. The workshop materials are available under a CC BY license and the document encourages attendees to cite any Ensembl papers if using the resource for their own work. Break times and locations are also listed on the agenda.
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgAndrew Su
Crowdsourcing approaches such as wikis and games are proposed to help address the problem of sparse gene annotation. The Gene Wiki allows collaborative annotation and has over 1 million words of content. Games like Dizeez use a game format to gather human input on gene-disease relationships. Crowdsourcing shows promise to efficiently gather large amounts of data to help complete the annotation of genes and better understand relationships between genes and diseases.
Presentation pathway extensions using knowledge integration and network approaches presented at the Systems Biology Institute in Luxembourg on November 28 2012.
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Karin Verspoor
Human-generated text is a critical component of recorded clinical data, yet remains an under-utilised resource in clinical informatics applications due to minimal standards for sharing of unstructured data as well as concerns about patient privacy. Where we can access and analyse clinical text, we find that it provides a hugely valuable resource. In this talk, I will describe two projects where we have used text classification as the basis for addressing a clinical objective: (1) a syndromic surveillance project where the task is the monitoring of health and social media data sources for changes that indicate the onset of disease outbreaks, and (2) the analysis of hospital records to enable retrieval of specific disease cases, for monitoring of the hospital case mix as well as for construction of patient cohorts for clinical research studies. I will end by briefly discussing the huge potential for clinical text analysis to support changing the way modern medicine is practised.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology. It involves the electronic storage, retrieval, analysis, and correlation of biological data. The document outlines key concepts in bioinformatics including the central dogma of molecular biology, biological data representation, how computers can be useful for biology, challenges in the field, and examples of intelligent bioinformatics applications. It emphasizes that bioinformatics is an important and growing field at the intersection of biology and computer science.
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013millerjeremya
The document discusses the goals and approaches of the Pro-iBiosphere project, which aims to make taxonomic data more accessible and interoperable by linking literature to datasets. It outlines challenges around technical and semantic interoperability of taxonomic data. It also describes the prospective approach of extracting structured data from publications and distributing it to biodiversity databases, and the retrospective approach of extracting elements from existing literature to populate databases.
Partitioning Heritability using GWAS Summary Statistics with LD Score Regressionbbuliksullivan
1) The document describes a new method for partitioning heritability of complex traits using summary statistics from large GWAS. It uses LD Score Regression to estimate the proportion of heritability associated with different functional annotations of the genome.
2) The method was validated in simulations, where it accurately estimated null and enriched heritability proportions.
3) The method was applied to real GWAS data for 10 complex traits, finding many functional elements enriched including conserved regions, enhancers, and cell-type specific H3K27ac regions, providing new insights into genetic architecture and disease biology.
Accelerate pharmaceutical r&d with mongo dbMongoDB
This document provides a summary of a presentation on using MongoDB and big data technologies to accelerate pharmaceutical research and development at AstraZeneca. The presentation discusses:
- AstraZeneca's focus on using next generation sequencing and big data to predict drug effectiveness and identify new drug targets
- Pilot projects using MongoDB to store and query unstructured genomic data at scale, which proved the technology's ability to enable researchers more quickly
- A vision for an experiment management system to integrate various data sources and processing pipelines using big data technologies
Using real-world evidence to investigate clinical research questionsKarin Verspoor
Adoption of electronic health records to document extensive clinical information brings with it the opportunity to utilise that information to support clinical research, and ultimately to support clinical decision making. In this talk, I discuss both these opportunities and the challenges that we face when working with real-world clinical data, and introduce some of the strategies that we are adopting to make this data more usable, and to extract more value from it. I specifically discuss the use of natural language processing to transform clinical documentation into structured data for this purpose.
Robogals 10th Anniversary Gala Keynote, Karin VerspoorKarin Verspoor
Karin is a woman in tech who advocates for diversity and inclusion. She shares statistics showing that gender diversity correlates with profitability and value creation. However, women remain underrepresented in tech fields. To address this, her document calls for developing talent pipelines for girls and women, disrupting status quos through role models and mentoring, and influencing choices by challenging stereotypes. The overall message is that diversity matters and a new approach is needed.
More Related Content
Similar to Using text mining to inform genetic variant interpretation
Learn how to use Pathway Studio to explore biomarkers and brain regions. With the addition of highly sophisticated visualization tools, users can interactively explore the vast number of connections created to help unravel disease biology. In addition, an innovative new taxonomy based on brain region identifications will be presented. Together, these innovations can be applied to rapidly increase the knowledge of diseases based on published findings.
Genetic architecture of developmental traits in populations of male gypsy mothscfriedline
This document summarizes a study on the genetic architecture of developmental traits in gypsy moth populations. The study established 7 gypsy moth populations in common gardens and sequenced 188 individuals to identify 11,021 SNPs. Three phenotypes - pupal duration, mass, and total development time - were measured. Population structure was corrected using PCA. Several SNPs were significantly associated with each trait, though effect sizes were small. Multilocus models explained over 50% of trait variation. Future work could involve refining the genome assembly and studying additional populations to detect smaller genetic effects.
François Fauteux, National Research Council Canada, Emerging strategies for computational ADC target selection and prioritization, World ADC 2017, San Diego
Flow Cytometry Training talks - part 1
This forms the first session of the Garvan Flow , Flow Cytometry Training course. this is a 1 1/2 day training course aimed at giving new and experienced researchers a better understanding of cytometry in medical and biological research.
Abstract: Ontologies are used in numerous research disciplines and commercial applications to uniformly and semantically annotate real-world objects. Due to a rapid development of application domains the corresponding ontologies are changed frequently to include up-to-date knowledge. These changes dramatically influence dependent data as well as applications/systems, for instance, ontology mappings, that semantically interrelate ontologies. The talk will give an overview on evolution of ontologies and ontology-based mappings.
Bioinformatics issues and challanges presentation at s p collegeSKUASTKashmir
This document provides an overview of bioinformatics and some key concepts:
- It discusses the exponential growth of biological data from technologies like PCR and microarrays, and how bioinformatics is needed to analyze this data.
- Bioinformatics is defined as integrating biology and computer science to collect, analyze, and interpret large amounts of molecular-level information. It uses databases and tools to study genomes, proteins, and biological processes.
- Major databases like GenBank, EMBL, and SwissProt store DNA, RNA, protein sequences and provide access to researchers. Tools like BLAST are used to search databases and analyze sequences.
- Benefits of bioinformatics include advances in medicine, agriculture, forensics
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
This document discusses using biological network approaches to dynamically extend pathways with regulatory information such as microRNAs (miRNAs). It describes tools like PathVisio that can integrate gene expression, proteomics and metabolomics data onto pathways to identify significantly changed processes. WikiPathways is introduced as a public pathway resource that can be contributed to and curated by researchers. The document outlines approaches for visualizing regulatory interactions on pathways using plugins, exploring pathway interactions through network analysis, and integrating other data types such as SNPs, fluxes and gene annotations to build a more comprehensive understanding of biological systems.
Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
This document provides information about a training workshop on using Ensembl. It includes an agenda for the day-long workshop covering topics like introduction to Ensembl, browsing genes and data, using BioMart, and exploring genetic variation data. The workshop materials are available under a CC BY license and the document encourages attendees to cite any Ensembl papers if using the resource for their own work. Break times and locations are also listed on the agenda.
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgAndrew Su
Crowdsourcing approaches such as wikis and games are proposed to help address the problem of sparse gene annotation. The Gene Wiki allows collaborative annotation and has over 1 million words of content. Games like Dizeez use a game format to gather human input on gene-disease relationships. Crowdsourcing shows promise to efficiently gather large amounts of data to help complete the annotation of genes and better understand relationships between genes and diseases.
Presentation pathway extensions using knowledge integration and network approaches presented at the Systems Biology Institute in Luxembourg on November 28 2012.
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Karin Verspoor
Human-generated text is a critical component of recorded clinical data, yet remains an under-utilised resource in clinical informatics applications due to minimal standards for sharing of unstructured data as well as concerns about patient privacy. Where we can access and analyse clinical text, we find that it provides a hugely valuable resource. In this talk, I will describe two projects where we have used text classification as the basis for addressing a clinical objective: (1) a syndromic surveillance project where the task is the monitoring of health and social media data sources for changes that indicate the onset of disease outbreaks, and (2) the analysis of hospital records to enable retrieval of specific disease cases, for monitoring of the hospital case mix as well as for construction of patient cohorts for clinical research studies. I will end by briefly discussing the huge potential for clinical text analysis to support changing the way modern medicine is practised.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology. It involves the electronic storage, retrieval, analysis, and correlation of biological data. The document outlines key concepts in bioinformatics including the central dogma of molecular biology, biological data representation, how computers can be useful for biology, challenges in the field, and examples of intelligent bioinformatics applications. It emphasizes that bioinformatics is an important and growing field at the intersection of biology and computer science.
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013millerjeremya
The document discusses the goals and approaches of the Pro-iBiosphere project, which aims to make taxonomic data more accessible and interoperable by linking literature to datasets. It outlines challenges around technical and semantic interoperability of taxonomic data. It also describes the prospective approach of extracting structured data from publications and distributing it to biodiversity databases, and the retrospective approach of extracting elements from existing literature to populate databases.
Partitioning Heritability using GWAS Summary Statistics with LD Score Regressionbbuliksullivan
1) The document describes a new method for partitioning heritability of complex traits using summary statistics from large GWAS. It uses LD Score Regression to estimate the proportion of heritability associated with different functional annotations of the genome.
2) The method was validated in simulations, where it accurately estimated null and enriched heritability proportions.
3) The method was applied to real GWAS data for 10 complex traits, finding many functional elements enriched including conserved regions, enhancers, and cell-type specific H3K27ac regions, providing new insights into genetic architecture and disease biology.
Accelerate pharmaceutical r&d with mongo dbMongoDB
This document provides a summary of a presentation on using MongoDB and big data technologies to accelerate pharmaceutical research and development at AstraZeneca. The presentation discusses:
- AstraZeneca's focus on using next generation sequencing and big data to predict drug effectiveness and identify new drug targets
- Pilot projects using MongoDB to store and query unstructured genomic data at scale, which proved the technology's ability to enable researchers more quickly
- A vision for an experiment management system to integrate various data sources and processing pipelines using big data technologies
Similar to Using text mining to inform genetic variant interpretation (20)
Using real-world evidence to investigate clinical research questionsKarin Verspoor
Adoption of electronic health records to document extensive clinical information brings with it the opportunity to utilise that information to support clinical research, and ultimately to support clinical decision making. In this talk, I discuss both these opportunities and the challenges that we face when working with real-world clinical data, and introduce some of the strategies that we are adopting to make this data more usable, and to extract more value from it. I specifically discuss the use of natural language processing to transform clinical documentation into structured data for this purpose.
Robogals 10th Anniversary Gala Keynote, Karin VerspoorKarin Verspoor
Karin is a woman in tech who advocates for diversity and inclusion. She shares statistics showing that gender diversity correlates with profitability and value creation. However, women remain underrepresented in tech fields. To address this, her document calls for developing talent pipelines for girls and women, disrupting status quos through role models and mentoring, and influencing choices by challenging stereotypes. The overall message is that diversity matters and a new approach is needed.
Machine learning -- the use of computational algorithms to find patterns in data -- is increasingly being deployed in clinical contexts to support diagnosis and treatment decisions. In the context of growing volumes of clinical data available in electronic form, there is an opportunity to realise dramatic changes in the practice of medicine through the application of large-scale health data analytics and predictive modeling. This talk will introduce a vision for the use of data-driven methods in health, while also raising important questions about the implementation of this vision: is it conceivable that one day your doctor might be replaced by a digital system? What are the risks?
Function and Phenotype Prediction through Data and Knowledge FusionKarin Verspoor
The biomedical literature captures the most current biomedical knowledge and is a tremendously rich resource for research. With over 24 million publications currently indexed in the US National Library of Medicine’s PubMed index, however, it is becoming increasingly challenging for biomedical researchers to keep up with this literature. Automated strategies for extracting information from it are required. Large-scale processing of the literature enables direct biomedical knowledge discovery. In this presentation, I will introduce the use of text mining techniques to support analysis of biological data sets, and will specifically discuss applications in protein function and phenotype prediction, exploring the integration of literature data with complementary structured resources.
Syndromic Surveillance from Emergency Department Triage NotesKarin Verspoor
Background
Syndromic surveillance refers to reporting and tracking of reportable and unusual diseases to public health officials. Conventional surveillance strategies are often manual, or depend on confirmatory laboratory testing after a disease diagnosis. These traditional strategies often result in relatively late detection of an outbreak or public health emergency. Strategies for reliably accelerating surveillance are under active research.
The aim of our work is detection of specific syndromes in individual patient triage records in the hospital Emergency Department (ED). We focus on analysing the free text clinical notes written by a triage nurse during a brief pre-diagnostic assessment of a patient upon arrival in the ED. The system can detect patients that appear to have a disease of interest.
Methods
We work with a set of over 310,000 records collected in two Victorian EDs over a several-year period. Each patient triage record in our data includes (1) a free text note and (2) a diagnostic code from the International Classification of Disease (ICD-10) that was assigned after the fact. This data was used for training and testing of various classifiers, in a cross-validation scenario. We experimented with a range of different set-ups, including attempting direct prediction of ICD-10 codes for a given triage note, as well as prediction of “syndromes” defined by a specific set of ICD-10 codes. We also experimented with several different feature representations and machine learning models.
Results
In general, the performance of the models for syndromes was better than for direct ICD-10 category classification, suggesting that the syndrome definitions are clinically coherent. We observed substantial variation in performance across the various syndromes; several syndromes had too few examples in the dataset to build an effective classifier. The best performance on these tasks used a machine learning model that incorporates pre-processing of the texts to identify direct mentions of ICD-10 and SNOMED CT terms.
Conclusion
We have demonstrated that it is possible to build an effective syndrome detection tool for ED triage notes, where there is adequate and reliable training data available for a given syndrome of interest. We have shown that semantic abstraction of the text into “medical concept space” is of benefit for this task.
Topic modeling of Emergency Department Triage notes for characterising pain-r...Karin Verspoor
Background
Pain is a feature of approximately 70% of all Emergency Department (ED) presentations. It has been demonstrated that mandating recording of a patient’s feeling of pain can improve service delivery for ED patients. However, there is a substantial group of patients (approximately 21% of ED visits in our 12-month sample) for which there exists an inconsistency between pain score and the Australian Triage Scale (ATS) score assigned by the nurse; where a patient reports high levels of pain but they are assigned a lower-urgency triage category. It has been unclear until now whether this “inconsistent” group of patients has been receiving optimal care.
Methods
To better understand the characteristics in this inconsistent group, we performed topic modeling of the clinical notes collected during ED triage assessments. We divided the notes into two subgroups, according to whether or not the patient’s self-reported level of pain was consistent with the triage urgency recorded in the ATS score. We performed topic modeling of these two subgroups separately, using the implementation of Latent Dirichlet Allocation (LDA) in the Mallet toolkit. We have experimented with several representations of the notes, including unigrams (tokens), bigrams, and the medical concepts contained in each note, as determined with the MetaMap medical concept recognition tool. An ED nurse reviewed the topics generated in each case and assigned a descriptor to them.
Results
When considering the token-based presentation of the notes, the labels in the consistent group are related to road trauma, cardiac pain, change of consciousness, ongoing chest pain, limb injury, renal illness and pain due to illness. In the inconsistent group, we find topics related to either conditions related to ongoing conditions (including postoperative complications or worsening abdominal pain), urinary and respiratory problems, infections and injury related complications.
When considering the concept-based representation of the notes, the labels in the consistent set denote gastrointestinal diseases, neurological illness, dizziness, chest pain, testicular pain, shortness of breath and trauma. The labels in the inconsistent set denote different issues caused by trauma and distress due to pain, infection and urinary condition. This includes injuries in several body parts like in the limbs and back. The latter topic containing body parts appears to have been enabled by the abstraction of individual terms into concepts.
Conclusions
Topic modeling of Emergency Department data shows substantial promise for helping to characterise particular subpopulations of interest, and incorporating pre-processing of clinical notes to capture variation in clinical terminology appears to have value. While this initial work has focused on the pain-related chief complaints, we have also recently begun to explore temporal characteristics of the data through analysis of how derived topics change ove
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Using text mining to inform genetic variant interpretation
1. Using text mining to inform
genetic variant interpretation
Karin Verspoor
Departmentof Computing and Information Systems
karin.verspoor@unimelb.edu.au
2. So you’re a medical doctor …
• With a very sick patient
• You can’t work out what’s going on
• You suspect a rare disease
• You order a DNA analysis
(whole exome or genome)
• And find a genetic mutation
What does it mean?
3. Clinical interpretation of variants
Sample Data Flow
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
Step
Wet Lab Bioinformatics Clinical Informatics
Patient
Sample
Sample Data Flow
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
StepPatient
Sample
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
ReportImage courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.
4. What’s a mutation?
• Genomic variation: alteration in a sequence
– hereditary (germ-line) mutations
– acquired (somatic) mutations
• Examples of variation
– SNP (single nucleotide polymorphism)
– Protein mutation
– insertions, deletions, duplications, inversions, . . .
• Types of variations
– DNA variations that have no adverse effects on our cells and
occur frequently in the population are called polymorphisms
– DNA variations that do affect the function of the protein
made from a gene and occur less often are called mutations
5. The Challenge: Interpreting variants
§ Identifying variation is becoming easier,
interpreting it remains difficult
• Which changes are due to normal individual variation?
• Which are associated with a phenotype of interest?
6. Interpreting variation through context
• Analysis of functional significance of variants
– Predicted impact of mutations
– Conservation analysis
– Allele frequencies from large genomic databases
• Existing knowledge captured in structured sources
– UniProt site-specific protein annotations
– The Cancer Gene Atlas genomic characterisation data
– Disease-specific variant databases, e.g. COSMIC and
InSiGHT
• Techniques for annotating variants
– Data aggregation from multiple sources
– Data integration and inference to reveal shared pathways
9. Structured resources are not enough:
Literature is the primary repository of knowledge
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
#Swiss-ProtProteins
Proteins missing a FUNCTION comment
Proteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”
Baumgartner et al ISMB 2007
“Our entire understanding of biology and medicine
is really contained in the published literature. And
since people write in natural language, if you can’t
get computers to turn that information into
databases and computable information, you’re
falling behind.”
-- Russ Altman, MD PhD, Stanford University
10. Recovery of variants from the
literature using text mining
Study:
Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying
the importance of supplementary material. Database: The Journal of Biological Databases and
Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]
11. Study: Recall of curated variants through the
application of text mining
• Given a curated resource of genetic variants,
• with explicit links to the source literature for
each variant,
• and a mutation extraction tool with
demonstrated good performance on intrinsic
evaluation
… how many variants can text mining recover?
13. Motivations
• Assess real-world applicability of text mining
tools for supporting analysis of genetic
variants
• Speed up curation of mutation databases
14. Two databases
• InSiGHT, Human Variome Project
– MLH1, MSH2, MSH6 and PMS2 linked to
Lynch syndrome (germline mutations)
• COSMIC, Sanger Institute
– Somatic mutations linked to cancer
Database
PMIDs
associated to
Mutations
Total
Mutation
Count
Average
Mutations
per article Std Dev
InSiGHT 809 7022 8.68 18.55
COSMIC 7898 198864 25.18 521.18
15. Literature mutation extraction
• Many tools exist to perform mutation annotation
– MutationMiner, MutationFinder, EMU, tmVar, SETH,
...
• Research shows that they have high precision
and recall on MEDLINE abstracts (> 90% F1)
• There are also tools to do named entity
extraction of genes, diseases, body parts …
Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust
recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi:
10.12688/f1000research.3-18.v2 [PMID:25285203]
16. How to extract mutations from text?
• Essentially a named entity recognition task.
• Early attempts focused on SNPs and protein mutations (amino
acid residues).
• e.g., MutationFinder1 patterns (simplified):
(?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid)
Gly17Ser
Ser97Pro
• where AminoAcid is:
(CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG|
TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE|
ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE|
THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE|
TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE|
TYROSINE)
1http://mutationfinder.sourceforge.net/
18. • Pattern-based approach to identifying genetic
variants
– dbSNP identifiers and standard HGVS nomenclature
(e.g. SETH https://rockt.github.io/SETH)
– natural language expressions of mutations
o This missense mutation converts a highly conserved glycine
(Gly17 of neurophysin) to a valine residue.
o Killer of prune (Kpn) is a mutation in the awd gene which
substitutes Ser for Pro at position 97 and causes dominant
lethality in individuals that do not have a functional prune gene.
o … where cysteines at positions 6, 42, 48, 90 and 393 were
replaced by serine.
Extraction of mutations from text
21. Extraction with EMU over our data
• EMU: Extract mutation from text and
link the mutations to co-occurring genes
• Normalize all mutation mentions to
HGVS format
– Format used in COSMIC and InSiGHT
• Match {gene, HGVS variant, PMID}
to curated data
22. Results
Abstracts and Full Text
NG = No Gene (ignoring gene in match)
Common/Cmn = PMIDs in common between database and corpus subset
(recall with respect to articles for which mutation entity recogniser had at
least one positive extraction)
Set
Cmn
art
Match
mutation
Recall Recall NG
Mutations
common
Recall
common
Recall
CmnNG
COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875
COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408
COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503
InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562
InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644
InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254
23. High Throughput vs non-High Throughput
Set
Cmn
art
Match
mutation
Recall Recall NG
Recall
common
Recall
CmnNG
HT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608
HT full text 1545 2719 0.0145 0.0172 0.027 0.0319
HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395
NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597
NHT full text 526 937 0.0815 0.0915 0.235 0.2639
NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895
Group PMIDs Count
Average
mutation
SD
Mutation
recall
COSMIC 7898 198 864 25.18 521.27 100.00%
COSMIC-HT 6266 187 367 29.9 584.82 94.22%
COSMIC-NHT 1632 11 497 7.04 38.05 5.78%
24. Considering tables and Supplementary
material
• Subset from COSMIC and InSiGHT available as
PubMed Central Open Access articles
• Supplementary material: MS Word, PDF, MS Excel,
PPT, images, …
InSiGHT COSMIC
Set Articles Matched Recall (%) Articles Matched Recall (%)
Abstracts 13 1 0.4 563 140 0.41
XML Full Text (FT) 9 20 7.94 487 694 2.05
PDF FT (PDFFT) 4 7 2.78 76 23 0.07
Tables 8 18 7.14 394 466 1.38
FT+PDFFT+Tables 13 44 17.46 563 929 2.75
Supp. Mat. 1 88 34.92 138 17015 50.59
All 13 115 45.63 563 17896 52.92
25. Recall still only 50%:
Where are the rest?
• Expressed in semi-structured data sources
– do not necessarily follow standard nomenclature more
predictably
– data spread unpredictably across columns (Wong et al.
2009)
• Different reference position in text than database
– curator correction or normalized to different build
• Nomenclature variation
– c.482_483delGA vs c.482_483del2
• Linguistic expression of mutations
– deletion of exon 3
– C>T mutation at nucleotide 2131
26. Information in tables (spreadsheets, etc.)
is expressed differently than in narrative text
28. Text mining over semi-structured data?
• Access ?
• Variability (!)
– File formats
– How connected to the main text?
• Semantics (?!)
– How to make sense of the data?
– How to map to standardized nomenclature?
… processing supplementary material will require new
strategies. Some technical solutions. Some research.
29. Extraction of gene-disease-
mutation relations
Study:
Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for
the Human Variome. BMC Medical Informatics and Decision Making.
30. Variant interpretation using literature
• Evidence of prior significance of variants
• Evidence of established connection of the variant
to specific patient cohorts
• Use alone or in combination with other evidence
• We aim to extract the relations that connect genes,
diseases and mutations
• Specific Objective of the work:
relation extraction over the
Variome Corpus
31. gene-mutation-disease-phenotype relations
• Variome Annotation Schema
– a schema defining entities and relations of interest
to curation of genetic variants
• Variome Corpus
– A corpus of full text articles annotated according to
the Variome Annotation Schema
– To be used as training and evaluation data for text
mining tools for extracting genetic variation
information from the published literature
31
http://www.opennicta.com.au/home/health/variome
32. The Variome Corpus
10 full-text publications related to colorectal cancer
Entities Relations
Gene Gene-has-Mutation
Mutation Cohort/Patient-has-Mutation
Disease Mutation-relatedto-Disease
Body part Disease-relatedto-Gene
Cohort/Patient Disease-relatedto-BodyPart
Size Mutation-has-Size
Age Cohort/Patient-has-Age
Gender Cohort/Patient-has-Gender
Ethnicity or Geo Location Cohort/Patient-has-EthnicityLoc
Characteristic Cohort/Patient-has-Disease
Cohort-has-size
Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP.
(2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of
Biological Databases and Curation, bat019.
§ 43k words
§ Double-
annotated
§ IAA varies
§ .88-.92 F for
entities
§ Relations
much lower;
reconciled
manually
34. • Recognise genetic variants
• Named entity recognition for gene names
– Supervised learning for recognizing characteristics and contexts
– Combined with dictionaries to support normalisation
• Associating variants to genes
– Simple co-occurrence
– Combined with sequence verification
– Machine learning for relation classification (PKDE4J)
Extraction of mutation relations from text
35. Information Extraction, Structuring text
From:
A subset of colorectal tumour DNA samples from 17
patients carrying the p.Lys618Ala variant …
To:
T60 body-part 1307 1317 colorectal
T7 disease 1318 1324 tumour
R17_m relatedTo Arg1:T60 Arg2:T7
(colorectal relatedTo tumour)
T61_merge size 1342 1344 17
T24 cohort-patient 1345 1353 patients
R46_2 has Arg1:T24 Arg2:T7
(patients has tumour)
T62 mutation 1367 1378 p.Lys618Ala
R18_m has Arg1:T24 Arg2:T61_merge
(patients has 17) = (patient group size 17)
R19_m has Arg1:T24 Arg2:T62
(patients has p.Lys618Ala)
36. PKDE4J: Yonsei University IE system
• PKDE4J
– Extensible, flexible text mining system for public knowledge
discovery
– Entity and relation extraction from the unstructured text data
– Extension of Stanford CoreNLP (Manning et al., 2014)
– http://informatics.yonsei.ac.kr/pkde4j
• Differentiation of PKDE4J
– Configurable system
• Dictionary based entity extraction
• Extensible system
• Wide range of relation extraction tasks developing an
extensible rule engine based on dependency parsing
– Accurate performance
• PKDE4J outperforms many other competing algorithms for
both entity and relation extraction
37. PKDE4J: Yonsei University IE system
• PKDE4J’s major two pipelines
– Entity Extraction: Target entities based on
dictionaries by extending Stanford CoreNLP
– Relation Extraction: relationships among entities
based on dependency tree based rules
41. PKDE4J – Relation Extraction
• Based on dependency parse (grammatical structure)
based rules
• To extract a relation
Step 1: Identify the verbs in a sentence
Category
Number of
Verbs
Type Verb Example
Positive 68
Increase Lead, Contribute, Rise
Transmit Shift, Move, Migrate
Substitute Supplement, Alter
Negative 54
Decrease Decline, Diffuse, Down-regulate
Remove Deplete, Abrogate, Disassociate
Neutral 111
Contain Possess, Constitute, Include
Modify Methylate, Modulate , Normalize
Method Bleach, Centrifuge, Spin
Report Evaluate, Analyze, Examine
Plain 165 Plain Return, Switch, Balance
42. PKDE4J - RE
Step 2: Check structure of sentence
• Syntactic rules based on deep parsing
– Dependency tree encodes grammaticalrelations between words
in a sentences.
– The tree denotes syntactic dependenciesbetween two entities.
– Need to spot the portion of parse tree that is useful, pertinent to
location of entities in a sentence.
43. PKDE4J - RE
• Rule Extraction
– Use Strategy design pattern
– Capture predefined rules (17 strategies)
①Verb in dependency path
②No verb in dependency path
③Detect nominalization
④Weak nominalization
⑤Negation
⑥Tense (active / passive)
⑦Contain clause
⑧Clause distance
⑨Negation clause
⑩Number intervening entities
⑪Entities in between
⑫Surface distance
⑬Entity counts
⑭Same head
⑮Entity order
⑯Full tree path
⑰Path length
44. Evaluation:
PKDE4J over Variome Corpus
• Experimental set-up
– Data split
– Features?
– 10-fold cross-validation
• Focus on relations:
Used gold standard entities
• Baseline co-occurrence system
45. Results of the evaluation
Relation Extraction results for relations with at least 100
examples in the corpus.
46. Observations
• By applying text mining we can transform the
literature from an unstructured, difficult to use
resource, to a structured resource.
• We can build systems that can recognise core
biological entities in the published literature.
• With this, the information is more accessible
– Formalised and normalised in a database
– Directly query-able
• and can be used to facilitate more computation:
– Information retrieval in terms of entities
– Predictive modeling and hypothesis generation
47. Conclusions
• Variants are relatively easy to recognise in the
literature, when the recommended
nomenclature is followed (so please use it!).
• The relations between variants and other
entities are harder to extract, but still we can do
a reasonable job.
• There is lots of information that is in ancillary
files associated to the literature (with some
challenges for automated systems).
The literature can be effectively mined to identify
variant-related information to assist biocuration
and clinical interpretation of variants.