Text mining the relations between chemicals and proteins is an increasingly important task. The CHEMPROT track at BioCreative VI aims to promote the development and evaluation of systems that can automatically detect the chemical-protein relations in running text (PubMed abstracts). This manuscript describes our submission, which is an ensemble of three systems, including a Support Vector Machine, a Convolutional Neural Network, and a Recurrent Neural Network. Their output is combined using a decision based on majority voting or stacking. Our CHEMPROT system obtained 0.7266 in precision and 0.5735 in recall for an f-score of 0.6410, demonstrating the effectiveness of machine learning-based approaches for automatic relation extraction from biomedical literature. Our submission achieved the highest performance in the task during the 2017 challenge.
Network pharmacology: From BioAssay Response Data to NetworkBin Chen
Network pharmacology, comparing protein pharmacology networks built from Ligand based approach (Similarity Ensemble Approach) with those built from BioAssay response data.
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for many years. In this webcast we will walk through one possible workflow for completing GWAS in Golden Helix SNP & Variation Suite (SVS) with special attention paid to adjusting analysis for population stratification.
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesGolden Helix Inc
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for nearly 10 years. Despite some assertions to the contrary, GWAS is not dead. GWAS is alive and well, and remains a viable technology for genetic discovery.
This webcast will cover:
GWAS data formats, usability, and data management techniques.
Imputation: Myths, facts, and when to use it.
Quality assurance: What questions should you be asking about your data?
Genotype association testing and statistics: Contingency tables, linear and logistic regression, Mixed Linear Models, and more.
Visualizations including Manhattan Plots, linkage disequilibrium plots, and genomic annotation sources.
Exploring public databases to investigate your results.
Tips for using exome chips and other targeted genotyping platforms.
Along the way, Dr. Christensen will highlight best practice approaches and common pitfalls to avoid. Golden Helix SNP & Variation Suite (SVS) software will be used to demonstrate many of these concepts.
Austin Neurology & Neurosciences is an open access, peer reviewed, scholarly journal dedicated to publish articles covering all areas of Neurology & Neurological Sciences.
The journal aims to promote research communications and provide a forum for doctors, researchers, physicians and healthcare professionals to find most recent advances in all areas of Neurology & Neurological Sciences. Austin Neurology & Neurosciences accepts original research articles, reviews, mini reviews, case reports and rapid communication covering all aspects of neurology & neurosciences.
Austin Neurology & Neurosciences strongly supports the scientific up gradation and fortification in related scientific research community by enhancing access to peer reviewed scientific literary works. Austin Publishing Group also brings universally peer reviewed journals under one roof thereby promoting knowledge sharing, mutual promotion of multidisciplinary science.
This document provides an overview of genome-wide association studies (GWAS). It discusses the basic concept of GWAS, running and analyzing a GWAS, and interpreting the results. Key points include: GWAS genotype individuals for hundreds of thousands to millions of SNPs to look for associations with traits; extensive quality control is required; imputation can increase SNP coverage; statistical analysis includes computing p-values and correcting for multiple testing; significant findings still require replication in independent samples.
Genome-wide association mapping identifies genomic regions associated with phenotypes by analyzing phenotypic and genotypic data. Phenotypic data includes traits like flowering time and yield, while genotypic data consists of genetic markers spanning the genome. Single nucleotide polymorphisms (SNPs) are commonly used markers. Association mapping fits statistical models to test for association between each SNP and the phenotype. Accounting for population structure and relatedness through mixed models reduces false positives. Significant associations between SNPs and traits suggest the SNP directly affects the trait or is linked to a causal variant. Results are visualized through Manhattan plots and QQ-plots.
This document provides information about different sequencing platforms and their characteristics. It compares Illumina, PacBio and Oxford Nanopore platforms in terms of average read length, advantages, limitations and recommended material. It also provides a comparison table of long-range sequencing platforms including their throughput per run, number of human genomes per run and cost per genome.
Network pharmacology: From BioAssay Response Data to NetworkBin Chen
Network pharmacology, comparing protein pharmacology networks built from Ligand based approach (Similarity Ensemble Approach) with those built from BioAssay response data.
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for many years. In this webcast we will walk through one possible workflow for completing GWAS in Golden Helix SNP & Variation Suite (SVS) with special attention paid to adjusting analysis for population stratification.
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesGolden Helix Inc
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for nearly 10 years. Despite some assertions to the contrary, GWAS is not dead. GWAS is alive and well, and remains a viable technology for genetic discovery.
This webcast will cover:
GWAS data formats, usability, and data management techniques.
Imputation: Myths, facts, and when to use it.
Quality assurance: What questions should you be asking about your data?
Genotype association testing and statistics: Contingency tables, linear and logistic regression, Mixed Linear Models, and more.
Visualizations including Manhattan Plots, linkage disequilibrium plots, and genomic annotation sources.
Exploring public databases to investigate your results.
Tips for using exome chips and other targeted genotyping platforms.
Along the way, Dr. Christensen will highlight best practice approaches and common pitfalls to avoid. Golden Helix SNP & Variation Suite (SVS) software will be used to demonstrate many of these concepts.
Austin Neurology & Neurosciences is an open access, peer reviewed, scholarly journal dedicated to publish articles covering all areas of Neurology & Neurological Sciences.
The journal aims to promote research communications and provide a forum for doctors, researchers, physicians and healthcare professionals to find most recent advances in all areas of Neurology & Neurological Sciences. Austin Neurology & Neurosciences accepts original research articles, reviews, mini reviews, case reports and rapid communication covering all aspects of neurology & neurosciences.
Austin Neurology & Neurosciences strongly supports the scientific up gradation and fortification in related scientific research community by enhancing access to peer reviewed scientific literary works. Austin Publishing Group also brings universally peer reviewed journals under one roof thereby promoting knowledge sharing, mutual promotion of multidisciplinary science.
This document provides an overview of genome-wide association studies (GWAS). It discusses the basic concept of GWAS, running and analyzing a GWAS, and interpreting the results. Key points include: GWAS genotype individuals for hundreds of thousands to millions of SNPs to look for associations with traits; extensive quality control is required; imputation can increase SNP coverage; statistical analysis includes computing p-values and correcting for multiple testing; significant findings still require replication in independent samples.
Genome-wide association mapping identifies genomic regions associated with phenotypes by analyzing phenotypic and genotypic data. Phenotypic data includes traits like flowering time and yield, while genotypic data consists of genetic markers spanning the genome. Single nucleotide polymorphisms (SNPs) are commonly used markers. Association mapping fits statistical models to test for association between each SNP and the phenotype. Accounting for population structure and relatedness through mixed models reduces false positives. Significant associations between SNPs and traits suggest the SNP directly affects the trait or is linked to a causal variant. Results are visualized through Manhattan plots and QQ-plots.
This document provides information about different sequencing platforms and their characteristics. It compares Illumina, PacBio and Oxford Nanopore platforms in terms of average read length, advantages, limitations and recommended material. It also provides a comparison table of long-range sequencing platforms including their throughput per run, number of human genomes per run and cost per genome.
The document describes Cignal Reporter Assays from SABiosciences that enable simple and robust analysis of signal transduction pathways. The assays utilize dual-luciferase reporters containing optimized transcriptional regulatory elements and luciferase variants to provide high sensitivity and low variability. The assays allow monitoring of 29 pathways and are available in different formats for various cell types and applications like RNAi and small molecule screening.
Next-generation sequencing has enabled clinicians and researchers alike to identify novel genetic variants associated with rare Mendelian Diseases across the human genome. To help enable researchers and clinicians understand the role of CNVs in human health and disease, Golden Helix has integrated a specialized NGS-based CNV caller capable of detecting deletion and duplication events as small as single-exons and as large as whole chromosome aneuploidy events. In this webcast, we will present our workflows that integrates the NGS-based CNV caller into SVS.
Deep learning for extracting protein-protein interactions from biomedical lit...Yifan Peng
The document presents a method called McDepCNN for extracting protein-protein interactions from biomedical literature using a multichannel dependency-based convolutional neural network. McDepCNN incorporates both automatically learned features from different CNN layers and manually crafted features using domain knowledge. It outperforms traditional machine learning and current deep learning models on two benchmark datasets, and generalizes better across different datasets than other methods. The model achieves its best performance using word embeddings, part-of-speech tags, named entities, dependency labels, and position features as input channels, and applying convolution with window sizes of 3, 5, and 7.
The document summarizes research presented at the CNCP 2010 conference. It describes work in several areas of proteomics research including fragmentation analysis, labeling strategies, de novo sequencing, identification, label-free quantification, database construction, data quality control, data processing platforms, glycoproteomics, and proteogenomics. Specific projects are mentioned that improved peptide identification, developed in vivo termini amino acid labeling, performed de novo sequencing of peptides from unknown genomes, optimized peptide mass fingerprinting for protein mixtures, detected post-translational modifications, and more.
Discover new cases studies giving you unprecedented access to both the data and results of how RNA-Seq is being applied successfully from bench to bedside
Gain new insights into RNA-Seq for the study of toxicity, IO, host-viral interactions and more from companies such as BMS, Janssen, Pfizer, Merck, UCSC and Stanford
Novogene is a Chinese genomics company founded in 2011 that now has over 1000 employees. It provides high-quality next-generation sequencing and bioinformatics services for research and clinical markets. Novogene has the largest sequencing capacity in China and is preparing for an IPO. It focuses on human whole genome, exome, and transcriptome sequencing, as well as microbial, plant, and animal sequencing and analysis. Novogene has successfully completed over 3000 customer projects utilizing its high performance computing capabilities.
1) Discovery: Over 1 million structural variant calls were discovered across 30 sequence-resolved callsets from 4 technologies for an AJ Trio. After clustering, over 128,000 sequence-resolved calls remained.
2) Discovery Support: Over 30,000 structural variants had support from 2+ technologies or 5+ callers in the trio.
3) Evaluate/genotype: Nearly 20,000 structural variants had a consensus variant genotype predicted for the son from analyzing the trio.
1) Discovery of over 1 million structural variant calls from 30+ sequence-resolved callsets across 4 technologies for an AJ Trio. 2) After clustering, over 128,000 sequence-resolved SV calls >=50bp remained. 3) Over 30,000 SVs had support from 2+ technologies or 5+ callers with sequences <20% different or support from optical mapping.
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
Conducted domain independent predictive analysis pipeline using R for cell type predictions. Applied many predictive analytics models, and machine learning techniques.
This document summarizes the Genome in a Bottle (GIAB) project, which develops reference materials and benchmarks for evaluating human genome sequencing and variant detection. GIAB has characterized 7 human genomes to high accuracy using diverse sequencing technologies. It provides extensive public sequencing data for benchmarking along with well-characterized variants. GIAB aims to improve benchmarks for difficult variants using linked reads, long reads, and diploid genome assemblies. The project collaborates widely and its reference materials and data are openly available to support innovation in genome sequencing and analysis.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Setia Pramana
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA and RNA sequence
Guess Lecture at Computer Science Department, IPB, Bogor
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Thermo Fisher Scientific
The document summarizes an analytical validation of the Oncomine Comprehensive Assay v3 (OCAv3) targeted next-generation sequencing panel performed in a CLIA-certified laboratory. The validation assessed analytical sensitivity, specificity, accuracy, and precision using formalin-fixed paraffin-embedded tumor samples and cell lines. Results showed the assay met performance thresholds of 90% or higher for detecting single nucleotide variants, insertions/deletions, copy number variants, and gene fusions across a wide range of variants. Over 2,500 clinical samples were subsequently sequenced with the assay maintaining a 95% success rate and average turnaround time of 10 days.
STRING - Prediction of a functional association network for the yeast mitocho...Lars Juhl Jensen
The document discusses predicting functional associations between proteins in the yeast mitochondrial system using the STRING database. It summarizes how STRING integrates genomic context, experimental data, and evidence from other species to infer functional links. It then describes applying these methods to predict mitochondrial proteins in yeast and build an association network for the yeast mitochondrial system, identifying functional modules within it.
Pooja Patel is seeking a laboratory technician position to further her experience in the molecular diagnostic field. She has a Bachelor of Science in Biochemistry from the University of Texas at Austin and a Bachelor of Science in Molecular Genetic Technology from the University of Texas at MD Anderson. Patel has over 3 years of laboratory experience in DNA extraction, PCR, sequencing, and bioinformatics. She is proficient in molecular biology techniques and is seeking to expand her skills in a full-time laboratory role.
The document describes a microarray study to analyze gene expression in atherosclerotic plaques and correlate it with factors related to plaque vulnerability. Specimens will be obtained from human carotid/coronary arteries and atherosclerotic plaques in mouse models. Gene expression will be profiled using microarrays and correlated with histopathology, pH, temperature, spectroscopy and other variables. The goal is to identify genes associated with vulnerable plaques and rupture. Plaques from influenza-infected and drug-treated mice will also be analyzed to study effects on gene expression and plaque structure.
132 gene expression in atherosclerotic plaquesSHAPE Society
This document discusses microarray studies to analyze gene expression in atherosclerotic plaques and correlate it with factors related to plaque vulnerability. It begins with background on the history and applications of DNA microarrays. Key steps discussed include probe design, sample preparation including tissue collection, labeling RNA samples, hybridizing samples to a microarray chip, scanning and analyzing image data. The document outlines creating a custom microarray based on selected genes and correlating gene expression with temperature, pH, spectroscopy and histopathology of plaques. It will also analyze gene expression in influenza-infected mice and mice where plaques are induced to rupture with drugs. Human carotid artery specimens from surgery will be analyzed from symptomatic and asymptomatic patients.
The document describes a microarray study to analyze gene expression in atherosclerotic plaques and correlate it with factors related to plaque vulnerability. Specimens will be obtained from human carotid/coronary arteries and atherosclerotic plaques in mouse models. Gene expression will be profiled using microarrays and correlated with histopathology, pH, temperature, spectroscopy and other variables. Plaques from influenza-infected and drug-treated mice will also be analyzed to identify genes associated with plaque rupture. The goal is to better understand plaque vulnerability and identify potential drug targets.
A unified database of structure/activity data is presented. This database was used to derive activity / classification models with Bayesian statistics and Linear Discriminant Analysis. This work has been published: http://www.nature.com/nbt/journal/v24/n7/abs/nbt1228.html
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
More Related Content
Similar to Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models
The document describes Cignal Reporter Assays from SABiosciences that enable simple and robust analysis of signal transduction pathways. The assays utilize dual-luciferase reporters containing optimized transcriptional regulatory elements and luciferase variants to provide high sensitivity and low variability. The assays allow monitoring of 29 pathways and are available in different formats for various cell types and applications like RNAi and small molecule screening.
Next-generation sequencing has enabled clinicians and researchers alike to identify novel genetic variants associated with rare Mendelian Diseases across the human genome. To help enable researchers and clinicians understand the role of CNVs in human health and disease, Golden Helix has integrated a specialized NGS-based CNV caller capable of detecting deletion and duplication events as small as single-exons and as large as whole chromosome aneuploidy events. In this webcast, we will present our workflows that integrates the NGS-based CNV caller into SVS.
Deep learning for extracting protein-protein interactions from biomedical lit...Yifan Peng
The document presents a method called McDepCNN for extracting protein-protein interactions from biomedical literature using a multichannel dependency-based convolutional neural network. McDepCNN incorporates both automatically learned features from different CNN layers and manually crafted features using domain knowledge. It outperforms traditional machine learning and current deep learning models on two benchmark datasets, and generalizes better across different datasets than other methods. The model achieves its best performance using word embeddings, part-of-speech tags, named entities, dependency labels, and position features as input channels, and applying convolution with window sizes of 3, 5, and 7.
The document summarizes research presented at the CNCP 2010 conference. It describes work in several areas of proteomics research including fragmentation analysis, labeling strategies, de novo sequencing, identification, label-free quantification, database construction, data quality control, data processing platforms, glycoproteomics, and proteogenomics. Specific projects are mentioned that improved peptide identification, developed in vivo termini amino acid labeling, performed de novo sequencing of peptides from unknown genomes, optimized peptide mass fingerprinting for protein mixtures, detected post-translational modifications, and more.
Discover new cases studies giving you unprecedented access to both the data and results of how RNA-Seq is being applied successfully from bench to bedside
Gain new insights into RNA-Seq for the study of toxicity, IO, host-viral interactions and more from companies such as BMS, Janssen, Pfizer, Merck, UCSC and Stanford
Novogene is a Chinese genomics company founded in 2011 that now has over 1000 employees. It provides high-quality next-generation sequencing and bioinformatics services for research and clinical markets. Novogene has the largest sequencing capacity in China and is preparing for an IPO. It focuses on human whole genome, exome, and transcriptome sequencing, as well as microbial, plant, and animal sequencing and analysis. Novogene has successfully completed over 3000 customer projects utilizing its high performance computing capabilities.
1) Discovery: Over 1 million structural variant calls were discovered across 30 sequence-resolved callsets from 4 technologies for an AJ Trio. After clustering, over 128,000 sequence-resolved calls remained.
2) Discovery Support: Over 30,000 structural variants had support from 2+ technologies or 5+ callers in the trio.
3) Evaluate/genotype: Nearly 20,000 structural variants had a consensus variant genotype predicted for the son from analyzing the trio.
1) Discovery of over 1 million structural variant calls from 30+ sequence-resolved callsets across 4 technologies for an AJ Trio. 2) After clustering, over 128,000 sequence-resolved SV calls >=50bp remained. 3) Over 30,000 SVs had support from 2+ technologies or 5+ callers with sequences <20% different or support from optical mapping.
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
Conducted domain independent predictive analysis pipeline using R for cell type predictions. Applied many predictive analytics models, and machine learning techniques.
This document summarizes the Genome in a Bottle (GIAB) project, which develops reference materials and benchmarks for evaluating human genome sequencing and variant detection. GIAB has characterized 7 human genomes to high accuracy using diverse sequencing technologies. It provides extensive public sequencing data for benchmarking along with well-characterized variants. GIAB aims to improve benchmarks for difficult variants using linked reads, long reads, and diploid genome assemblies. The project collaborates widely and its reference materials and data are openly available to support innovation in genome sequencing and analysis.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Setia Pramana
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA and RNA sequence
Guess Lecture at Computer Science Department, IPB, Bogor
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Thermo Fisher Scientific
The document summarizes an analytical validation of the Oncomine Comprehensive Assay v3 (OCAv3) targeted next-generation sequencing panel performed in a CLIA-certified laboratory. The validation assessed analytical sensitivity, specificity, accuracy, and precision using formalin-fixed paraffin-embedded tumor samples and cell lines. Results showed the assay met performance thresholds of 90% or higher for detecting single nucleotide variants, insertions/deletions, copy number variants, and gene fusions across a wide range of variants. Over 2,500 clinical samples were subsequently sequenced with the assay maintaining a 95% success rate and average turnaround time of 10 days.
STRING - Prediction of a functional association network for the yeast mitocho...Lars Juhl Jensen
The document discusses predicting functional associations between proteins in the yeast mitochondrial system using the STRING database. It summarizes how STRING integrates genomic context, experimental data, and evidence from other species to infer functional links. It then describes applying these methods to predict mitochondrial proteins in yeast and build an association network for the yeast mitochondrial system, identifying functional modules within it.
Pooja Patel is seeking a laboratory technician position to further her experience in the molecular diagnostic field. She has a Bachelor of Science in Biochemistry from the University of Texas at Austin and a Bachelor of Science in Molecular Genetic Technology from the University of Texas at MD Anderson. Patel has over 3 years of laboratory experience in DNA extraction, PCR, sequencing, and bioinformatics. She is proficient in molecular biology techniques and is seeking to expand her skills in a full-time laboratory role.
The document describes a microarray study to analyze gene expression in atherosclerotic plaques and correlate it with factors related to plaque vulnerability. Specimens will be obtained from human carotid/coronary arteries and atherosclerotic plaques in mouse models. Gene expression will be profiled using microarrays and correlated with histopathology, pH, temperature, spectroscopy and other variables. The goal is to identify genes associated with vulnerable plaques and rupture. Plaques from influenza-infected and drug-treated mice will also be analyzed to study effects on gene expression and plaque structure.
132 gene expression in atherosclerotic plaquesSHAPE Society
This document discusses microarray studies to analyze gene expression in atherosclerotic plaques and correlate it with factors related to plaque vulnerability. It begins with background on the history and applications of DNA microarrays. Key steps discussed include probe design, sample preparation including tissue collection, labeling RNA samples, hybridizing samples to a microarray chip, scanning and analyzing image data. The document outlines creating a custom microarray based on selected genes and correlating gene expression with temperature, pH, spectroscopy and histopathology of plaques. It will also analyze gene expression in influenza-infected mice and mice where plaques are induced to rupture with drugs. Human carotid artery specimens from surgery will be analyzed from symptomatic and asymptomatic patients.
The document describes a microarray study to analyze gene expression in atherosclerotic plaques and correlate it with factors related to plaque vulnerability. Specimens will be obtained from human carotid/coronary arteries and atherosclerotic plaques in mouse models. Gene expression will be profiled using microarrays and correlated with histopathology, pH, temperature, spectroscopy and other variables. Plaques from influenza-infected and drug-treated mice will also be analyzed to identify genes associated with plaque rupture. The goal is to better understand plaque vulnerability and identify potential drug targets.
A unified database of structure/activity data is presented. This database was used to derive activity / classification models with Bayesian statistics and Linear Discriminant Analysis. This work has been published: http://www.nature.com/nbt/journal/v24/n7/abs/nbt1228.html
Similar to Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models (20)
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models
1. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 1
Chemical-protein relation
extraction with ensembles of SVM,
CNN, and RNN models
Yifan Peng1, Anthony Rios1,2, Ramakanth Kavuluru2,3, Zhiyong Lu1
1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
2Department of Computer Science, University of Kentucky
3Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky
2. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 2
• Individual models
• SVM
• CNN
• RNN
• Ensembles of three models
• Majority voting
• Stacking
• Experiments
• 5-fold cross validation on training + dev set
• Results on test set
Outline
3. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 3
• A multiclass classification problem
• The chemical-protein relations occurring in a single sentence
Chemical-protein relations
4. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 4
SVM
• Linear kernel
• One-vs-rest scheme
Rich Feature Vector
• Words/Part-of-speech tags surrounding the chemical and gene mentions
• Bag-of-words between the chemical and gene mentions
• Distance between two entity mentions
• Shortest path in a dependency graph
SVM with rich feature vector
Miwa, M.; Sætre, R.; Miyao, Y. & Tsujii, J. A rich
feature vector for protein-protein interaction
extraction from multiple corpora. Proceedings of the
2009 Conference on Empirical Methods in Natural
Language Processing, 2009, 1, 121-130
5. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 5
• Obtained using Bllip parser + Stanford dependencies converter
• Vertex walks
• CHEMICAL – nsubj – inhibits
• inhibits – dobj – induction
• induction – nmod:of – GENE
• Edge walks
• nsubj – inhibits – dobj
• dobj – induction – nmod:of
Shortest path in a dependency graph
6. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 6
Convolutional Neural Network
Peng, Y. & Lu, Z. Deep
learning for extracting
protein-protein interactions
from biomedical literature.
Proceedings of BioNLP 2017,
2017, 29-38
7. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 7
Convolutional Neural Network
• Word embedding: 300
• trained on PubMed using word2vec
• Part-of-speech, chunk and named entities: one-hot encoding
• Obtained using Genia Tagger
• Convolutional window size: 3 and 5
• Filters: 300
8. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 8
Recurrent Neural Network
Kavuluru, R.; Rios, A. & Tran, T. Extracting
Drug-Drug Interactions with Word and
Character-Level Recurrent Neural
Networks. 2017 IEEE International
Conference on Healthcare Informatics
(ICHI), 2017, 5-12
9. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 9
Recurrent Neural Network
• Pairwise ranking loss
• The output layer has 5 positive classes
• If all 5 class scores are negative, then we predict the negative class
• Preprocessing
• Replace word occurs less than 5 times with an UNK token
• Word embedding: 300
• Obtained from GloVe
Santos, C. N.; Xiang, B. & Zhou, B.
Classifying Relations by Ranking with
Convolutional Neural Networks. ACL, 2015,
626-634
10. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 10
Ensembles of SVM, CNN, and RNN models
11. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 11
Majority voting
• Select the relations that are predicted by more than 2 models
Stacking
• Random Forest classifier
• 17 features:
• 6 from SVM
• 6 from CNN
• 5 from RNN (pariwise ranking loss)
Ensembles of SVM, CNN, and RNN models
12. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 12
• Combine training and development sets
• 5-fold cross validation
• 60% for training
• 20% for development (also used to train the stacking systems)
• 20% for test
Results for 5-fold cross validation
13. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 13
Models P R F
SVM 0.629 0.478 0.543
CNN 0.641 0.571 0.602
RNN 0.608 0.614 0.609
Majority voting 0.741 0.552 0.632
Stacking 0.755 0.552 0.638
Results of 5-fold cross validation
14. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 14
Run System P R F
1 Majority Voting 0.7437 0.5529 0.6343
2 Majority Voting 0.7283 0.5503 0.6269
3 Stacking 0.7426 0.5382 0.6241
4 Stacking 0.7311 0.5685 0.6397
5 Stacking 0.7266 0.5735 0.6410
Results on test set
15. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 15
System P R F
5-fold CV Majority voting 0.7408 0.5517 0.6319
Stacking 0.7554 0.5524 0.6378
Testing Majority Voting 0.7437 0.5529 0.6343
Stacking 0.7266 0.5735 0.6410
Results on test set
16. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 16
Summary
• Ensemble systems of three models: SVM, CNN, and RNN
• Results are consistent on training + development set and on the test set
• Ensemble methods improved the precisions
• Performance of CNN and RNN are comparable
Future work
• Error analysis
• Fair comparisons between CNN and RNN
• Effects of different parts of deep learning models
Summary and future work
17. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 17
• The organizers of the BioCreative VI CHEMPROT task
• Members
• Yifan Peng, NCBI
• Anthony Rios, Department of Computer Science, University of Kentucky
• Ramakanth Kavuluru, Department of Internal Medicine, University of Kentucky
• Zhiyong Lu, NCBI
Acknowledgement
18. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 18
Thank You!
yifan.peng@nih.gov
Editor's Notes
I am Yifan Peng from the bio text mining group at NCBI. Our group participated in the chemical protein relation extraction task.
Luckily, our submissions achieved the highest performance in the task during the 2017 challenge
I am glad to have a chance to report our ensemble system and share some observerations.
This is architecture of our system.
I will start with three individual models we built for this task. SVM, CNN and RNN
How we built the ensembles of these three models.
Resutls by 5-fold CV on the training and development sets
5-fold cross validation to reduce the variety on the dataset and better understand how these model preform on the data.
We built the svm with rich feature vector. Here we use a liniear kernel and one-vs-one scheme. The feature was selected based Miwa’s work in 2009.
a walk is defined as alternating sequences of vertices and edges, vi, ei, i+ 1, vi+ 1, ei+ 1,i+ 2, …, vi+n− 1, beginning with a vertex and ending with a vertex. The length of a walk is the number of edges that it uses. We take into consideration walks of length 1
e-walk that starts and ends with an edge (e.g. ei, i+ 1, vi+ 1, ei+ 1,i+ 2). It is actually not a walk defined in the graph theory but an ad hoc concept to provide a syntactic structure for the learning model because contexts by syntactic relations are crucial.
There are two input layers. One is matrix of whole sentence. The other is a matrix of shortest path between two entities.
each word in either a sentence or a shortest path is represented by concatenating embeddings of its words, part-of-speech tags, chunks, named entities, dependencies, and positions relatively to two mentions of interest.
dimensionality
Max pooling across hidden state of LSTM
We concatenate the two representations to obtain the final representation of each word.
To obtain a representation of the sentence, we use max-over-time (1-max) pooling across hidden state word representations.
the output layer only has 5 classes, where we completely discard the negative class. Specifically, we use the pairwise ranking loss proposed by Santos et al (16). At prediction time, if all 5 class scores are negative, then we predict the negative class. Otherwise, we predict the class with the largest positive score.
There are many aspects we need to exploere after BioCreative VI
That ’s how we built three models. Now I will describe two ensemble methods to combine the results from 3 individual models
One way is by majority voting. So If more than 2 models predict the same relation type, we pick it. Otherwise, they are negative.
The other way is using stacking. We trained a random forest classifier using all the predications of three models.
For SVM: we use the distance of the samples X to the separating hyperplane.
For CNN, we use the unnormalized scores for each class before passing them through the softmax function
For RNN, we use the scores after softmax
To reduce variability, 5-fold cross-validation was performed using different partitions of the data.
Improve the SVM if we spend more time on feature engineering.
Results of CNN and RNN are comparable on F-score
Using the ensemble system, either majority voting or stacking, the precision is greatly improved by at least 10 percent. As the result, the f-score is also improved.
So the ensemble really pushed us over the top in final submissions
The results are quite consistent on the training+dev and on the test set.
The second highest precision is ~0.67.
The second highest f-score is 0.6141.
Results by 5-fold cross validation
The results are quite consistent on the training+dev and on the test set.
The model didn’t encounter overfitting.