This document describes the development of a method for extracting transcription factor to protein interactions from text. It involves developing a pipeline to read documents, recognize annotations, generate features, perform machine learning, and evaluate predictions. Key steps include dependency parsing, generating sentence-level, token-level, n-gram, and dependency features, and evaluating performance using precision, recall, and f-measure. The method achieves 69.1% precision, 69.9% recall, and 69.3% f-measure on the developed corpus.
The document provides information about various conferences and workshops including ACL-IJCNLP 2015 which had 173 long papers presented orally and 68 as posters. It also summarizes several research papers related to natural language processing including automatic prediction of drunk texting, modeling argument strength in student essays, driving ROVER with segment-based ASR quality estimation, and multi-level translation quality prediction with QUEST++. Finally, it mentions an unsupervised method for decomposing multi-author documents and identifying age-appropriate ratings of song lyrics from text.
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...Araport
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only genes, RNA, protein and compounds but also the complicated interactions among them. Yet, even in the most thoroughly studied model plant Arabidopsis thaliana, the knowledge regarding these interactions are scattered throughout literatures and various public databases. Thus, new scientific discovery by exploring these complex and heterogeneous data remains a challenge task for biologists.
We developed a graph-search empowered platform named HRGRN to search known and, more importantly, discover the novel relationships among genes in Arabidopsis biological networks. The HRGRN includes over 51,000 “nodes” that represent very large sets of genes, proteins, small RNAs, and compounds and approximately 150,000 “edges” that are classified into nine types of interactions (interactions between proteins, compounds and proteins, transcription factors (TFs) and their downstream target genes, small RNAs and their target genes, kinases and downstream target genes, transporters and substrates, substrate/product compounds and enzymes, as well as gene pairs with similar expression patterns to provide deep insight into gene-gene relationships) to comprehensively model and represent the complex interactions between nodes. .
The HRGRN allows users to discover novel interactions between genes and/or pathways, and build sub-networks from user-specified seed nodes by searching the comprehensive collections of interactions stored in its back-end graph databases using graph traversal algorithms. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/. Currently, we are collaborating the Araport team to develop REST-like web services and provide the HRGRN’s graph search functions to Araport system.
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
The document discusses using formal language theory to model genotype to phenotype (G2P) mappings. It proposes that G2P mappings are non-linear networks rather than linear pathways, and that formal languages could be used to formally represent these networks. Specifically, it suggests using concepts from computational linguistics like context-free grammars, attribute grammars, and semantic actions to parse genetic sequences and compute their phenotypic outcomes. As an example, it presents a context-free grammar for designing genetic constructs and computing their chemical dynamics using an attribute grammar. In summary, formal languages may provide a way to rigorously define the complex non-linear relationships between genotypes and resulting phenotypes.
Web services provide programmatic interfaces to online services and tools, allowing for machine-to-machine communication across the web. They have become widely used in the life sciences, with major providers like EMBL-EBI, DDBJ, and NCBI offering hundreds of services. However, ensuring the sustainability, usability, and reliability of web services remains an ongoing challenge. Catalogs aim to help users discover and understand available services, but require community involvement to maintain accurate and up-to-date information.
InterProscan is a database that combines different protein signature recognition methods to identify distant relationships and infer protein function. It integrates predictive information from partner resources to classify proteins into families and identify their domains and sites. Users can submit novel nucleotide or protein sequences to InterProScan to scan the signatures in the InterPro database. Matches are output in various formats to functionally characterize the submitted sequences. The document then provides steps for using the InterProscan database to analyze protein sequences and view results that identify family membership and conserved sites.
The document describes the Transcriptome Analysis Console (TAC) software from Affymetrix, which provides powerful and intuitive tools for analyzing gene expression data from Affymetrix microarrays. The TAC software allows researchers to perform differential expression analysis, visualize gene pathways and networks, explore interactions between coding and non-coding RNA, and link results to public databases. It is designed to simplify the analysis and interpretation of complex gene expression data.
The document provides information about RNA-seq analysis using R and Bioconductor. It begins with an introduction to the BCBB branch and its services assisting researchers with bioinformatics and computational projects. The document then discusses RNA-seq, R, and Bioconductor individually before explaining how they can be used together for RNA-seq analysis. Step-by-step tutorials and resources are provided for differential expression analysis and other tasks using R packages like DESeq2.
Data Mining - Short Story Assignment (2).pptxvijithagunta1
The document summarizes a research paper that investigated using ChatGPT to generate synthetic clinical text for training models for biological named entity recognition and relation extraction tasks. The researchers found that generating synthetic data with ChatGPT and fine-tuning local models on this data significantly improved performance over both zero-shot ChatGPT and state-of-the-art models, while also addressing privacy concerns with real patient data. The paper demonstrates the potential of leveraging large language models to generate synthetic data for improving clinical text mining applications.
The document provides information about various conferences and workshops including ACL-IJCNLP 2015 which had 173 long papers presented orally and 68 as posters. It also summarizes several research papers related to natural language processing including automatic prediction of drunk texting, modeling argument strength in student essays, driving ROVER with segment-based ASR quality estimation, and multi-level translation quality prediction with QUEST++. Finally, it mentions an unsupervised method for decomposing multi-author documents and identifying age-appropriate ratings of song lyrics from text.
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...Araport
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only genes, RNA, protein and compounds but also the complicated interactions among them. Yet, even in the most thoroughly studied model plant Arabidopsis thaliana, the knowledge regarding these interactions are scattered throughout literatures and various public databases. Thus, new scientific discovery by exploring these complex and heterogeneous data remains a challenge task for biologists.
We developed a graph-search empowered platform named HRGRN to search known and, more importantly, discover the novel relationships among genes in Arabidopsis biological networks. The HRGRN includes over 51,000 “nodes” that represent very large sets of genes, proteins, small RNAs, and compounds and approximately 150,000 “edges” that are classified into nine types of interactions (interactions between proteins, compounds and proteins, transcription factors (TFs) and their downstream target genes, small RNAs and their target genes, kinases and downstream target genes, transporters and substrates, substrate/product compounds and enzymes, as well as gene pairs with similar expression patterns to provide deep insight into gene-gene relationships) to comprehensively model and represent the complex interactions between nodes. .
The HRGRN allows users to discover novel interactions between genes and/or pathways, and build sub-networks from user-specified seed nodes by searching the comprehensive collections of interactions stored in its back-end graph databases using graph traversal algorithms. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/. Currently, we are collaborating the Araport team to develop REST-like web services and provide the HRGRN’s graph search functions to Araport system.
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
The document discusses using formal language theory to model genotype to phenotype (G2P) mappings. It proposes that G2P mappings are non-linear networks rather than linear pathways, and that formal languages could be used to formally represent these networks. Specifically, it suggests using concepts from computational linguistics like context-free grammars, attribute grammars, and semantic actions to parse genetic sequences and compute their phenotypic outcomes. As an example, it presents a context-free grammar for designing genetic constructs and computing their chemical dynamics using an attribute grammar. In summary, formal languages may provide a way to rigorously define the complex non-linear relationships between genotypes and resulting phenotypes.
Web services provide programmatic interfaces to online services and tools, allowing for machine-to-machine communication across the web. They have become widely used in the life sciences, with major providers like EMBL-EBI, DDBJ, and NCBI offering hundreds of services. However, ensuring the sustainability, usability, and reliability of web services remains an ongoing challenge. Catalogs aim to help users discover and understand available services, but require community involvement to maintain accurate and up-to-date information.
InterProscan is a database that combines different protein signature recognition methods to identify distant relationships and infer protein function. It integrates predictive information from partner resources to classify proteins into families and identify their domains and sites. Users can submit novel nucleotide or protein sequences to InterProScan to scan the signatures in the InterPro database. Matches are output in various formats to functionally characterize the submitted sequences. The document then provides steps for using the InterProscan database to analyze protein sequences and view results that identify family membership and conserved sites.
The document describes the Transcriptome Analysis Console (TAC) software from Affymetrix, which provides powerful and intuitive tools for analyzing gene expression data from Affymetrix microarrays. The TAC software allows researchers to perform differential expression analysis, visualize gene pathways and networks, explore interactions between coding and non-coding RNA, and link results to public databases. It is designed to simplify the analysis and interpretation of complex gene expression data.
The document provides information about RNA-seq analysis using R and Bioconductor. It begins with an introduction to the BCBB branch and its services assisting researchers with bioinformatics and computational projects. The document then discusses RNA-seq, R, and Bioconductor individually before explaining how they can be used together for RNA-seq analysis. Step-by-step tutorials and resources are provided for differential expression analysis and other tasks using R packages like DESeq2.
Data Mining - Short Story Assignment (2).pptxvijithagunta1
The document summarizes a research paper that investigated using ChatGPT to generate synthetic clinical text for training models for biological named entity recognition and relation extraction tasks. The researchers found that generating synthetic data with ChatGPT and fine-tuning local models on this data significantly improved performance over both zero-shot ChatGPT and state-of-the-art models, while also addressing privacy concerns with real patient data. The paper demonstrates the potential of leveraging large language models to generate synthetic data for improving clinical text mining applications.
Next-generation sequencing is producing vast amounts of genomic data that is challenging to store, analyze, and make sense of biologically. The author developed a pipeline and website to map short reads from micromonas samples to a reference genome, count mapped reads in exons, introns, and other regions, and visualize read mapping across chromosomes to begin addressing these challenges. Key steps included filtering reads, mapping with BWA, Bowtie and Bfast, and using BedTools and other software to analyze mappings and produce figures for visualization.
This document proposes quality measures for assessing linkset quality in linked data. It defines quality indicators, scoring functions, and aggregate metrics for evaluating linksets. Quality indicators examine aspects like entity types and counts. Scoring functions measure type coverage, completeness, and entity coverage within linksets. Interpretation tables help users understand scoring results and determine next steps. The measures specifically address linkset completeness for complementing datasets. The work contributes a first formalization and prototype for linkset quality assessment.
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
A Workshop at the Stowers Institute for Medical Research.
Cytoscape is an open source software platform used to visualize molecular interaction networks and integrate gene expression data. It was created in 2002 at the Institute of Systems Biology and has since been developed by an international consortium. Cytoscape can be used to analyze and visualize networks in biological research as well as other domains involving nodes and edges. It features the ability to load, save, and analyze networks along with gene expression profiles and functional annotations to identify active subnetworks and hypotheses about regulatory interactions.
Presentaion for NetBio SIG 2013 by Robin Haw, Scientific Associate and Outreach Coordinator, Ontario Institute for Cancer Research. “Reactome Knowledgebase and Functional Interaction (FI) Cytoscape Plugin”
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
TAMBIS (The Anthropic Mediated Bioinformatics Service) aims to provide a single query language, data model, and location for distributed biological information sources by creating the illusion of transparency. It does this through ontologies that provide a consistent shared understanding of metadata, and middleware that rewrites user queries against the ontology into coordinated multi-source requests. While the illusion of transparency is appealing, it requires significant effort to maintain and does not accommodate the changing nature of sources. The greatest outcomes were found to be the ontologies and knowledge representation techniques developed for the system.
This document summarizes a presentation given by Rafael Jimenez from the European Bioinformatics Institute (EBI) on the standards and tools used for molecular interaction data, including PSI-MI XML, PSI-MITAB, PSICQUIC, and databases like IntAct. It describes the formats for representing interaction data, tools for working with the data, methods for data distribution through PSICQUIC, and databases that provide interaction data following these standards, including IntAct.
The document provides information about biological databases and sequence identifiers. It discusses the main objectives of biological databases which include information systems, query systems, storage systems and data. It describes primary databases like GenBank, EMBL, DDBJ, UniProt and PDB as well as secondary curated databases like RefSeq, Taxon and OMIM. It also explains different types of sequence identifiers used in databases like LOCUS, ACCESSION, VERSION, gi numbers and protein identifiers.
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Jeff Melching, Big Data Engineer and Architect at Monsanto, discussed Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study. The bioinformatics domain and in particular computational genomics has always had the problem of computing analytics against very large data sets. Traditionally, these analytics have leveraged grid and compute farm technologies. Additionally, the analytics software and algorithms have been built up over the past 30 years by contributions from both the public and private domain and written in a number of programming languages. When these software packages are brought in house and combined with the skills and preferences of internal bioinformatics researchers, what you get is a myriad of different technologies linked together in an analytics pipeline. The rise of technologies like MapReduce in hadoop have made the execution of such pipelines much more efficient, but what about all those analytic pipelines I have built up over the years that aren’t written in MapReduce? Do I have to rewrite them? Do I have to know java? This talk will explain how hadoop streaming can help you reuse instead of rewriting. It will also touch on techniques for packaging and deploying hadoop applications without having to centrally manage software versions on the cluster.
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
exRNA Data Analysis Tools in the Genboree Workbenchexrna
This document summarizes Genboree services for exRNA data analysis including the Genboree Workbench for data analysis tools, Genboree Commons for document sharing and discussion, and GenboreeKB for metadata tracking. It demonstrates using the exceRpt small RNA-seq analysis pipeline in the Genboree Workbench and provides an overview of the exRNA Atlas for browsing exRNA profiling studies and samples based on metadata.
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
This document provides a summary of a presentation on using MongoDB and big data technologies to accelerate pharmaceutical research and development at AstraZeneca. The presentation discusses:
- AstraZeneca's focus on using next generation sequencing and big data to predict drug effectiveness and find associations between gene sequences and drug responses.
- Pilot projects using MongoDB to store and query unstructured genomic and clinical trial data at scale in a flexible document format.
- How these pilots helped prove the value of NoSQL databases for enabling faster exploration and analysis of large, complex datasets by researchers.
- Future visions for using experimental management systems and big data analytics to integrate multiple data types and power predictive analytics across AstraZeneca's drug development pipelines
The document discusses using R and Bioconductor for gene ontology (GO) term analysis. It describes GO terms, R, and Bioconductor. It then outlines several Bioconductor packages for working with GO terms, including GO.db for basic GO term data, TopGO for gene enrichment analysis, GOProfiles for statistical analysis of functional profiles, and GOSim for analyzing gene similarities based on GO terms. An example is given of comparing GO terms between two Arabidopsis chromosomes using these tools.
This document discusses several genes related to stem cell pluripotency, including OCT4, SOX2, NANOG, and LIN28. It provides information on the functions of these genes obtained from searches of PubMed, NCBI Gene, and other bioinformatics databases. Details include OCT4's role in maintaining pluripotency, SOX2's interaction with OCT4 and DNA binding structure, alignments of NANOG mRNA and protein sequences between human and mouse, and conserved domains identified in human and mouse LIN28 proteins through BLAST and CDD searches.
This document discusses several data integration tools: DAS, PSICQUIC, EnFIN, EnCORE, and Biomart. DAS is a distributed annotation system that allows uniform access to biological data from multiple repositories. PSICQUIC integrates molecular interaction data based on the PSI-MI standard. EnFIN, EnCORE, and EnVISION provide data integration across various domains, sources, formats and types by standardizing data in an EnXML format and developing web services. Biomart allows federated querying of biological data across different databases through a common query interface.
Nils Gehlenborg presented on visualizing 3D genome data. He discussed how Hi-C data captures chromosomal interactions to measure 3D genome structure, but results in massive interaction matrices. Existing tools have limitations in scale, comparison across conditions, and navigation. Gehlenborg demonstrated HiGlass, an interactive multi-resolution viewer for exploring these matrices. He also introduced HiPiler for investigating detected patterns, by filtering, aggregating, and correlating them with other data. HiGlass and HiPiler address challenges of visualizing 3D genome data at different scales and across many samples.
Next-generation sequencing is producing vast amounts of genomic data that is challenging to store, analyze, and make sense of biologically. The author developed a pipeline and website to map short reads from micromonas samples to a reference genome, count mapped reads in exons, introns, and other regions, and visualize read mapping across chromosomes to begin addressing these challenges. Key steps included filtering reads, mapping with BWA, Bowtie and Bfast, and using BedTools and other software to analyze mappings and produce figures for visualization.
This document proposes quality measures for assessing linkset quality in linked data. It defines quality indicators, scoring functions, and aggregate metrics for evaluating linksets. Quality indicators examine aspects like entity types and counts. Scoring functions measure type coverage, completeness, and entity coverage within linksets. Interpretation tables help users understand scoring results and determine next steps. The measures specifically address linkset completeness for complementing datasets. The work contributes a first formalization and prototype for linkset quality assessment.
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
A Workshop at the Stowers Institute for Medical Research.
Cytoscape is an open source software platform used to visualize molecular interaction networks and integrate gene expression data. It was created in 2002 at the Institute of Systems Biology and has since been developed by an international consortium. Cytoscape can be used to analyze and visualize networks in biological research as well as other domains involving nodes and edges. It features the ability to load, save, and analyze networks along with gene expression profiles and functional annotations to identify active subnetworks and hypotheses about regulatory interactions.
Presentaion for NetBio SIG 2013 by Robin Haw, Scientific Associate and Outreach Coordinator, Ontario Institute for Cancer Research. “Reactome Knowledgebase and Functional Interaction (FI) Cytoscape Plugin”
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
TAMBIS (The Anthropic Mediated Bioinformatics Service) aims to provide a single query language, data model, and location for distributed biological information sources by creating the illusion of transparency. It does this through ontologies that provide a consistent shared understanding of metadata, and middleware that rewrites user queries against the ontology into coordinated multi-source requests. While the illusion of transparency is appealing, it requires significant effort to maintain and does not accommodate the changing nature of sources. The greatest outcomes were found to be the ontologies and knowledge representation techniques developed for the system.
This document summarizes a presentation given by Rafael Jimenez from the European Bioinformatics Institute (EBI) on the standards and tools used for molecular interaction data, including PSI-MI XML, PSI-MITAB, PSICQUIC, and databases like IntAct. It describes the formats for representing interaction data, tools for working with the data, methods for data distribution through PSICQUIC, and databases that provide interaction data following these standards, including IntAct.
The document provides information about biological databases and sequence identifiers. It discusses the main objectives of biological databases which include information systems, query systems, storage systems and data. It describes primary databases like GenBank, EMBL, DDBJ, UniProt and PDB as well as secondary curated databases like RefSeq, Taxon and OMIM. It also explains different types of sequence identifiers used in databases like LOCUS, ACCESSION, VERSION, gi numbers and protein identifiers.
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Jeff Melching, Big Data Engineer and Architect at Monsanto, discussed Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study. The bioinformatics domain and in particular computational genomics has always had the problem of computing analytics against very large data sets. Traditionally, these analytics have leveraged grid and compute farm technologies. Additionally, the analytics software and algorithms have been built up over the past 30 years by contributions from both the public and private domain and written in a number of programming languages. When these software packages are brought in house and combined with the skills and preferences of internal bioinformatics researchers, what you get is a myriad of different technologies linked together in an analytics pipeline. The rise of technologies like MapReduce in hadoop have made the execution of such pipelines much more efficient, but what about all those analytic pipelines I have built up over the years that aren’t written in MapReduce? Do I have to rewrite them? Do I have to know java? This talk will explain how hadoop streaming can help you reuse instead of rewriting. It will also touch on techniques for packaging and deploying hadoop applications without having to centrally manage software versions on the cluster.
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
exRNA Data Analysis Tools in the Genboree Workbenchexrna
This document summarizes Genboree services for exRNA data analysis including the Genboree Workbench for data analysis tools, Genboree Commons for document sharing and discussion, and GenboreeKB for metadata tracking. It demonstrates using the exceRpt small RNA-seq analysis pipeline in the Genboree Workbench and provides an overview of the exRNA Atlas for browsing exRNA profiling studies and samples based on metadata.
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
This document provides a summary of a presentation on using MongoDB and big data technologies to accelerate pharmaceutical research and development at AstraZeneca. The presentation discusses:
- AstraZeneca's focus on using next generation sequencing and big data to predict drug effectiveness and find associations between gene sequences and drug responses.
- Pilot projects using MongoDB to store and query unstructured genomic and clinical trial data at scale in a flexible document format.
- How these pilots helped prove the value of NoSQL databases for enabling faster exploration and analysis of large, complex datasets by researchers.
- Future visions for using experimental management systems and big data analytics to integrate multiple data types and power predictive analytics across AstraZeneca's drug development pipelines
The document discusses using R and Bioconductor for gene ontology (GO) term analysis. It describes GO terms, R, and Bioconductor. It then outlines several Bioconductor packages for working with GO terms, including GO.db for basic GO term data, TopGO for gene enrichment analysis, GOProfiles for statistical analysis of functional profiles, and GOSim for analyzing gene similarities based on GO terms. An example is given of comparing GO terms between two Arabidopsis chromosomes using these tools.
This document discusses several genes related to stem cell pluripotency, including OCT4, SOX2, NANOG, and LIN28. It provides information on the functions of these genes obtained from searches of PubMed, NCBI Gene, and other bioinformatics databases. Details include OCT4's role in maintaining pluripotency, SOX2's interaction with OCT4 and DNA binding structure, alignments of NANOG mRNA and protein sequences between human and mouse, and conserved domains identified in human and mouse LIN28 proteins through BLAST and CDD searches.
This document discusses several data integration tools: DAS, PSICQUIC, EnFIN, EnCORE, and Biomart. DAS is a distributed annotation system that allows uniform access to biological data from multiple repositories. PSICQUIC integrates molecular interaction data based on the PSI-MI standard. EnFIN, EnCORE, and EnVISION provide data integration across various domains, sources, formats and types by standardizing data in an EnXML format and developing web services. Biomart allows federated querying of biological data across different databases through a common query interface.
Nils Gehlenborg presented on visualizing 3D genome data. He discussed how Hi-C data captures chromosomal interactions to measure 3D genome structure, but results in massive interaction matrices. Existing tools have limitations in scale, comparison across conditions, and navigation. Gehlenborg demonstrated HiGlass, an interactive multi-resolution viewer for exploring these matrices. He also introduced HiPiler for investigating detected patterns, by filtering, aggregating, and correlating them with other data. HiGlass and HiPiler address challenges of visualizing 3D genome data at different scales and across many samples.
3. Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
2
4. Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
• Protein modifies or interacts with transcription factor
2
5. Transcription Factors Interactions
• Binds to specific DNA sequences and controls transcription
• Two types of interactions
• Transcription factor transcribes gene
• Protein modifies or interacts with transcription factor
• Estimated 45,000+ such interactions
2
6. • TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
7. • TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
8. • TRANSFAC
• Manually curated DB
• Eukaryotic TF and genomic binding
sites
• Commercial version contains
reports for 21,000 transcription
factors[1]
• Public version contains reports for
7,000 transcription factors[1]
• TcoF – Transcription co-Factor
Database
• 1365 transcription factors[2]
• Manually curated from BioGrid,
MINT and EBI
• Integrated Transcription Factor
Platform
• Predicted interactions using
sequence data[3]
• SVMs
3
Related Work
[1]: https://portal.biobase-international.com/archive/documents/transfac_comparison.pdf
[2]: http://cbrc.kaust.edu.sa/tcof/
[3]: http://itfp.biosino.org/itfp/
11. 6
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Filter Swissprot for transcription-factor activity, sequence specific DNA-
binding (GO Term GO:0003700) and Homo sapiens
12. 7
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for
“Interaction with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
For each protein, obtain list of publications cited for “INTERACTION WITH”
13. 8
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Run GNormPlus, a gene tagger, on each of these abstracts – giving Gene or
Gene Product (GGP)
14. 9
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Entrez Gene IDs normalized to Uniprot IDs using priority selection
(first Swissprot, then TrEMBL)
15. 10
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
All GGPs cross-referenced with GO Term GO:0003700 and its descendants. If
Uniprot ID contains annotation for transcription factor activity, labeled as
transcription factor
16. 11
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and
Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Correct entity boundaries and offsets, and add abbreviations
17. 12
Scan Uniprot for
GO:0003700 &
Homo sapiens
List publications
cited for “Interaction
with”
Run GNormPlus
Normalize Gene ID
to Uniprot ID
If Uniprot ID
contains
GO:0003700, label
transcription factor
Offset and Boundary
Corrections
Manual Annotation
of Relations on
tagtog
Annotation of relations on tagtog manually
21. Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format
22. Method Development
• Pipeline (based on nalaf)
16
Data
Reader
Annotation
Recognition
Feature
Generator
ParserTokenizerSplitter Learning Evaluator
Dataset object gets created and passed around from one module to the next in the pipeline
Read in textual
data from
some input
(PMID, HTML)
to create a
Dataset object
Read in
Annotations
from PubTator
or ann.json
and augment it
to Dataset
Splits the text
of each
Document in
the dataset
into sentences
Creates Tokens
representing
the smallest
processing
unit, usually
words
Parses each
sentence and
store syntactic
and
dependency
parse trees
Generate
features for
learning a
model
Handles
learning and
prediction
using
SVMLight
Evaluates
performance
of prediction
at the edge
and document
level
Edge
Generator
Generates
potential
relations and
reduces relation
extraction to
binary
classification
Writer
Writes the
predicted
annotations in
tagtog
compatible
ann.json
format
28. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
20
Feature example:
“androgen receptor”,
“receptor interacts”,
“interacts with”, “with SMRT”
29. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
20
Feature example:
“linear_distance” : 2
30. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts -> with -> SMRT
31. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
[“interacts”, “with”]
32. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
“root_word”: “interacts”
33. Feature Generation
• Sentence Features
• Token Features
• N-gram Features
• Linear Context and Distance between Entities
• Dependency Features
• Shortest path between the two entities
• Path constituents
• Root word
• Path to root word
20
Feature example:
receptor -> interacts
SMRT -> with -> interacts
44. Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
30
45. Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
30
46. Conclusion
• Development of new corpus with semi-automatic annotation
• Published at PubAnnotation: http://pubannotation.org/projects/relna
• Development of new method for extracting relations of transcription
factors
• Registered with Elixir Tools: https://bio.tools/tool/RostLab/relna/0.1.0
• Available on GitHub: https://github.com/Rostlab/relna
• Integration into nalaf and building a generalized relation extraction
tool
30
47. Future Work
• Coreference resolution techniques
• Generalizing method for spanning multiple sentences
• Further testing with neural networks
31
53. Parsing
• Dependency Parsing
• Identify the relations between words
• O(n3) algorithm, with n as the number of words in the sentence
• Constituency Parsing
• Identify phrases (noun chunks, verb chunks etc.) and their relative structure
and hierarchy in the sentence
• O(n5) algorithm, with n as the number of words in the sentence
36
54. Exhaustive List of Features
• Sentence Features
• BOW, Stem
• #entities, #BOW count
• Token Features
• Token Text, Masked Text
• Stem, POS
• Capitalization, Digits, Hyphens and
other Punctuations
• Char bigrams and trigrams
• Dependency Feature for
Shortest Paths
• Path direction (eg. FFRFR)
• Dependency types in path
• Path length
• Intermediate Tokens
• Path Constituents (eg. “interact”,
“bind” etc.)
• Root word of the sentence
• Linear Context
33
55. Other Features
• Linear distance between entities
• Presence of specific words in the
sentence
• Prior tokens
• Intermediate tokens
• Post tokens
• N-gram features
• Bigram
• Trigram
• Relative Entity Order
• Conjoint Entity Text
34
56. N-gram Features
• The cow jumped over the moon
• “the”, “cow”, “jumped”, “over”, “moon”
• “the cow”, “cow jumped”, “jumped over”, “over the”, “the moon”
• “the cow jumped”, “cow jumped over”, “jumped over the”, “over the moon”
• …
35