Data integration - Integration of functional associations using STRING

•Download as PPT, PDF•

3 likes•448 views

Lars Juhl Jensen

EMBO World Practical Course on Computational Biology, Shanghai Jiao Tong University, Shanghai, China, August 22, 2009.

Jensen, Kuhn et al., Nucleic Acids Research , 2009

von Mering et al., Nucleic Acids Research , 2005

Korbel et al., Nature Biotechnology , 2004

Beyer et al., Nature Reviews Genetics , 2007

BIND Biomolecular Interaction Network Database

BioGRID General Repository for Interaction Datasets

MIPS Munich Information center for Protein Sequences

Letunic & Bork, Trends in Biochemical Sciences , 2008

KEGG Kyoto Encyclopedia of Genes and Genomes

PID NCI-Nature Pathway Interaction Database

OMIM Online Mendelian Inheritance in Man

[object Object],[object Object],[object Object],[object Object],[object Object]

Frishman et al., Modern Genome Annotation , 2009

Kuhn et al., Nucleic Acids Research , 2008

Acknowledgments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

This document discusses using networks to derive biological function from genomic data. It mentions several types of data that can be used like gene expression, protein-protein interactions, genetic interactions, pathways, literature mining, and co-mentioning in text. It also notes challenges integrating these diverse data sources that have different formats, identifiers, quality, and are spread across many databases and genomes. Lastly, it recommends combining all available evidence to predict functional associations.

Data integration and functional association networks

Lars Juhl Jensen

The STITCH and Reflect web resources

Lars Juhl Jensen

Integration of heterogeneous data

Lars Juhl Jensen

The STITCH and Reflect web resources

Lars Juhl Jensen

The document discusses two web resources called STITCH and Reflect that integrate biological data from multiple sources. STITCH provides a REST web service for bulk downloading parts lists and protein information from 630 genomes and databases in different formats. Reflect provides augmented browsing of biological data through a browser add-on and allows collaboration. It integrates multiple data types from sources with variable quality that are spread across 630 genomes.

Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

Lars Juhl Jensen leads a group that conducts large-scale integration of biological and medical data using proteomics, text mining, and medical data mining. The group develops protein interaction networks, disease networks, and association networks. They collaborate internationally on projects involving over 9.6 million proteins and 2000 genomes. The group works to integrate data from many sources in different formats to build comprehensive networks and knowledgebases, and also mines biomedical text to link genes and proteins with diseases.

Exploring the role of DNA methylation as a source of phenotypic variation in ...

mgavery

Unraveling cellular phosphorylation networks using computational biology

Lars Juhl Jensen

This document summarizes the STRING protein association database and network analysis tool. It integrates data from genomic context, gene fusions, co-expression and experimental interactions for over 9.6 million proteins. The data comes from various sources and is standardized and scored. Text mining is used to extract protein associations from over 10,000 PubMed abstracts. The network data can be accessed through the STRING website or downloaded for analysis in Cytoscape or R/Bioconductor. Users can perform protein, disease or PubMed queries.

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

This document discusses how gene association networks integrate large datasets and text to link genes based on various types of evidence from experimental data, curated knowledge, co-expression, and physical interactions. The associations are compiled into a comprehensive resource, STRING, which combines all evidence using quality scores and cross-species transfer to connect over 9.6 million genes into a large-scale network. The network is accessible online through the STRING website and Cytoscape app and provides a global view of functional gene associations.

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

This document discusses gene association networks and large-scale data and text integration. It describes how STRING generates association networks from genomic context, gene fusion, coexpression, and curated knowledge from databases. Text mining is used to extract additional associations from the scientific literature, as natural language processing techniques like named entity recognition, information extraction, and semantic tagging are applied to extract gene and protein relationships from text. The extracted information is integrated with experimental interaction data to build comprehensive gene association networks.

Network biology - Large-scale integration of data and text

Lars Juhl Jensen

The document discusses network biology and integration of large-scale data and text to build interaction networks. It introduces the STRING database, which contains over 9.6 million proteins and integrates interaction data from curated databases, experiments, textmining, and predictive methods. The document uses human insulin receptor (INSR) as an example to demonstrate searching and analyzing the STRING network, showing evidence from different data sources for its interaction with IRS1. It also introduces other integrated networks in the STRING group including STITCH, COMPARTMENTS, TISSUES and DISEASES.

Large-scale data and text mining

Lars Juhl Jensen

The document discusses Lars Juhl Jensen's work in large-scale data and text mining including developing databases and tools for predicting protein function, modeling cell cycle regulation and signaling networks, compiling datasets on protein interactions and functional associations, text mining of biomedical literature, and developing databases of drug-drug and drug-target interactions. It acknowledges collaborators on various projects involving NetPhorest, STRING, STITCH, NetworKIN, and Reflect.

Gene association networks: Large-scale integration of data and text

Lars Juhl Jensen

This document discusses gene association networks created by integrating large-scale data and text mining. It describes how databases with information on genes and proteins are integrated from various sources with different formats and identifiers. Text mining is used to extract gene and protein associations from over 10,000 biomedical publications through named entity recognition, co-mentioning of genes within documents, and quality scoring of associations. The integrated network can be accessed through the STRING database website for network analysis and queries about proteins, diseases, or published articles.

Gene association networks: Large-scale integration of data and text

Lars Juhl Jensen

Systems biology - Understanding biology at the systems level

Lars Juhl Jensen

The document discusses systems biology and its goal of understanding biology at the systems level. It explains that systems biology studies complete biological systems by integrating multiple types of high-throughput omics data and mathematical modeling. It provides examples of modeling the cell cycle and integrating gene expression, protein interaction, and genetic interaction networks to understand complex multi-layer regulation within biological systems. Interactive online databases are described that allow users to explore omics data, expand networks, and investigate relationships between biological entities and diseases.

Unraveling signaling networks by data integration

Lars Juhl Jensen

The document discusses the work of Lars Juhl Jensen and others on integrating biological data to build predictive models of cell signaling networks. Key areas discussed include using data integration to predict protein function, build models of cell cycle regulation, identify new drug targets through drug repurposing, build models of phosphorylation signaling networks, and predict kinase-substrate relationships. Methods discussed include using protein interaction and gene expression data to build association networks and using machine learning on motifs to build tools like NetworKIN, NetPhorest, and STRING to predict functional relationships.

Unraveling signal transduction networks through data integration

Lars Juhl Jensen

The document discusses methods for integrating different types of biological data to build networks that model signal transduction pathways. It describes using protein sequence motifs to predict kinase-substrate relationships, and combining this with protein interaction and expression data to provide context. Validation studies on ATM and Cdk1 signaling pathways showed this approach could accurately predict phosphorylation sites and the kinases that target them. Future work involves improving scoring methods and expanding to other types of post-translational modifications and model organisms.

Large-scale integration of data and text

Lars Juhl Jensen

The document discusses large-scale integration of biological data and text to build interaction networks. It outlines different data sources like protein complexes, pathways, gene expression, and physical interactions that provide heterogeneous biological information. Integrating these diverse data sources into predictive protein interaction networks requires mapping between different identifiers, assessing quality scores, and using techniques like text mining to handle the vast amount of unstructured text data.

Data Integration and Systems Biology

Lars Juhl Jensen

The document discusses Lars Juhl Jensen's work in data integration and systems biology. It describes some of his key projects including developing methods to map phosphorylation networks, build interaction networks using genomic context data from multiple species, and create the NetworKIN tool to predict kinase-substrate relationships by integrating sequence motifs, protein-protein interactions, and phosphorylation data. The work has helped provide more accurate predictions of phosphorylation sites and their regulating kinases by taking into account protein context and experimental validation.

Network biology: A basis for large-scale biomedical data mining

Lars Juhl Jensen

The document discusses network biology and large-scale data mining techniques used to analyze biomedical data. It describes several databases and tools developed including NetPhorest for predicting kinase-substrate relationships from sequence motifs, STRING for mapping protein-protein interaction networks across 630 genomes, and methods to predict drug side effects and potential new uses based on shared targets and side effect similarities. It also acknowledges contributions to developing these resources from researchers across several institutions.

Gene association networks: Large-scale integration of data and text

Lars Juhl Jensen

This document summarizes gene association networks and large-scale integration of data and text. It discusses databases like STRING and STITCH that contain functional associations between proteins, chemicals, pathways and protein complexes extracted from experimental data, co-expression and physical interactions. It also describes how text mining is used to extract additional associations from the scientific literature by using techniques like named entity recognition, expansion rules and co-mentioning within documents. The identified associations are scored and integrated into a global network that can be analyzed for functional insights.

From phosphoproteomics to signaling networks

Lars Juhl Jensen

The document discusses using phosphoproteomics data and machine learning methods to build networks of signaling pathways by mapping phosphorylation sites to potential upstream kinase activities and downstream protein interactions. It describes methods such as NetworKIN and NetPhorest that have been developed to integrate diverse datasets in order to build more comprehensive networks and determine the context and functions of phosphorylation events. Validation using model organisms is also discussed.

Information integration

Lars Juhl Jensen

The document discusses information integration and association networks. It describes STRING, a database of known and predicted protein-protein interactions that integrates interaction data from genomic context, high-throughput experiments, co-expression and text mining for more than 1100 organisms. STRING generates association networks and scores the confidence of interactions based on different lines of evidence. The document also discusses analyzing cell cycle data using STRING and visualizing networks and external data in Cytoscape.

Mining heterogeneous data: Understanding systems at the level of complexes an...

Lars Juhl Jensen

The document discusses understanding biological systems at the level of complexes and networks. It mentions that the cell cycle consists of four phases (G1, S, G2, M) involving growth and DNA replication. Gene expression and protein complexes are regulated during the cell cycle. Studies of yeast cell cultures and microarray time courses have analyzed cycling genes and temporal networks. The hypothesis of just-in-time assembly of protein complexes is explored, where the time of peak mRNA levels matches the time of protein synthesis. This phenomenon is also generalized to metabolic pathways involving deoxynucleotide synthesis. Comparisons are made between species using orthologous genes, and correlations are examined between mRNA expression and phosphorylation patterns.

Network biology: Large-scale data and text mining

Lars Juhl Jensen

This document discusses network biology and large-scale data and text mining. It describes how Lars Jensen uses computational predictions from over 1100 genomes along with experimental data and information extracted from text to build protein-protein association networks in STRING. These networks integrate known and predicted protein-protein interactions with functional associations, and are used to study biological systems at the network level.

Protein–protein interaction networks

Lars Juhl Jensen

Survey Results Age Of Unbounded Data June 03 10

nhaque

Enterprises today can generate, collect and consider more data than ever before. New types of data can provide insight into previously opaque processes and motivations, but prodigious quantities of data present opportunity, as well as complexity and distraction. nGenera Insight’s 2010 Leading in an Age of Unbounded Data survey garnered responses from over 70 major organizations, including many global corporations, to provide a cross-industry pulse of the state of enterprise data.

Webinar: SnapLogic Winter 2015

SnapLogic

In this webinar, we talk about our Winter 2015 release, which introduces enhanced security and lifestyle management capabilities for integration platform as a service (iPaaS) deployments, new and updated cloud and on-premises connectors, called Snaps, and a continued focus on productivity and self-service features for "citizen integrators." We also discuss what's new and go through a series of demonstrations by SnapLogic product management including pipeline lifecycle management, improved developer productivity, big data integration and more. To learn more, visit: www.snaplogic.com/winter2015

Industry Report: The State of Customer Data Integration in 2013

Scribe Software Corp.

Report from Scribe Software that surveyed over 900 businesses worldwide, states customer data integration has become a core business issue as organizations struggle to attain the ideal of the connected enterprise and drive business value from IT investments while managing increasingly complex IT environments. “Businesses are struggling to reach the connected enterprise nirvana,” noted Lou Guercia, CEO of Scribe. “With the continued move to cloud and complex hybrid environments, the lack of integration between these systems is becoming clearer and significantly slowing business value.”

What's hot

Protein association networks: Large-scale integration of data and text

Lars Juhl Jensen

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

Network biology - Large-scale integration of data and text

Lars Juhl Jensen

Large-scale data and text mining

Lars Juhl Jensen

Gene association networks: Large-scale integration of data and text

Lars Juhl Jensen

Gene association networks: Large-scale integration of data and text

Lars Juhl Jensen

Systems biology - Understanding biology at the systems level

Lars Juhl Jensen

Unraveling signaling networks by data integration

Lars Juhl Jensen

Unraveling signal transduction networks through data integration

Lars Juhl Jensen

Large-scale integration of data and text

Lars Juhl Jensen

Data Integration and Systems Biology

Lars Juhl Jensen

Network biology: A basis for large-scale biomedical data mining

Lars Juhl Jensen

Gene association networks: Large-scale integration of data and text

Lars Juhl Jensen

From phosphoproteomics to signaling networks

Lars Juhl Jensen

Information integration

Lars Juhl Jensen

Mining heterogeneous data: Understanding systems at the level of complexes an...

Lars Juhl Jensen

Network biology: Large-scale data and text mining

Lars Juhl Jensen

Protein–protein interaction networks

Lars Juhl Jensen

What's hot (19)

Protein association networks: Large-scale integration of data and text

Gene association networks - Large-scale integration of data and text

Network biology - Large-scale integration of data and text

Large-scale data and text mining

Gene association networks: Large-scale integration of data and text

Systems biology - Understanding biology at the systems level

Unraveling signaling networks by data integration

Unraveling signal transduction networks through data integration

Large-scale integration of data and text

Data Integration and Systems Biology

Network biology: A basis for large-scale biomedical data mining

Gene association networks: Large-scale integration of data and text

From phosphoproteomics to signaling networks

Information integration

Mining heterogeneous data: Understanding systems at the level of complexes an...

Network biology: Large-scale data and text mining

Protein–protein interaction networks

Viewers also liked

Survey Results Age Of Unbounded Data June 03 10

nhaque

Webinar: SnapLogic Winter 2015

SnapLogic

Industry Report: The State of Customer Data Integration in 2013

Scribe Software Corp.

RDAP 15: Research Data Integration in the Purdue Libraries

ASIS&T

Webinar: Attaining Excellence in Big Data Integration

SnapLogic

This document discusses best practices for attaining excellence in big data integration. It notes that analytics and integration are top investment areas for big data technologies. There is still uncertainty around which Hadoop tools and distributions to use. The document recommends five best practices: 1) evaluate integration processes, 2) examine new approaches, 3) evaluate technology needs, 4) investigate dedicated integration technology, and 5) gain benefits that outweigh costs. It also discusses using the cloud for big data integration.

Data sources and collection methods

Governance Asssessment Portal

With the help of this powerpoint presentation, Ken Mease, discusses the advantages of various types of data sources and collection methods, including archival and secondary data, survey data, quantitative and qualitative approaches and data, and finally de jure and de facto information. The presentation was held at the Workshop on Governance Assessment Methods and Applications of Governance Data in Policy-Making (June 2009)

Viewers also liked (6)

Survey Results Age Of Unbounded Data June 03 10

Webinar: SnapLogic Winter 2015

Industry Report: The State of Customer Data Integration in 2013

RDAP 15: Research Data Integration in the Purdue Libraries

Webinar: Attaining Excellence in Big Data Integration

Data sources and collection methods

Similar to Data integration - Integration of functional associations using STRING

The STRING database

Lars Juhl Jensen

The STRING database integrates known and predicted protein-protein interactions, including direct (physical) and indirect (functional) associations derived from genomic context, high-throughput experiments, co-expression and literature mining. It covers over 373 proteomes and draws on data from curated databases, textmining and computational prediction methods to provide a global network of protein interactions. STRING uses a scoring scheme to assign probabilities to interactions based on different lines of evidence and benchmarking against a gold standard reference set.

Cross-species data integration

Lars Juhl Jensen

Integration of heterogeneous data

Lars Juhl Jensen

The document discusses the integration of heterogeneous biological data and the development of computational tools and databases to analyze protein-protein interaction networks, phosphorylation signaling networks, and other molecular pathways. It describes several databases and web tools created by the author and other researchers, including NetworKIN, STRING, STITCH, NetPhorest, and Reflect, that combine data from diverse sources to build networks and gain new biological insights. It also addresses ongoing challenges in data integration like variable data quality, different data formats and identifiers, and the need for continued benchmarking and validation of computational predictions.

Protein interaction networks

Lars Juhl Jensen

The document discusses protein interaction networks and the STRING database. It describes how STRING uses genomic context, gene fusion, co-expression, and curated data to predict protein-protein interactions. It also explains how STRING integrates this interaction data with chemical compound data to build networks connecting proteins and chemicals. The document provides examples of how STRING can be used to analyze the cell cycle and temporal protein interaction networks, and links to websites for exploring the STRING and STITCH databases.

STRING - Modeling of biological systems through cross-species data integ...

Lars Juhl Jensen

The document discusses the STRING database, which integrates data from diverse sources to predict protein-protein interactions and functional associations. It summarizes different lines of evidence used by STRING, including genomic context, co-expression, co-mentioning in articles, and transfer of functional annotations between orthologs. The document also briefly outlines how STRING scores and benchmarks different predictive methods and defines functional modules to model biological systems.

Integration of diverse large-scale datasets

Lars Juhl Jensen

The document discusses the integration of diverse large-scale datasets to build comprehensive protein-protein interaction networks. It describes challenges with data from different sources having different identifiers, evidence types and quality. It also discusses methods used by STRING and other databases to combine data from curated databases, literature mining, primary datasets and transfer of interactions based on orthology. Examples are given of cell cycle studies in yeast that have analyzed periodically expressed genes and protein interactions.

Large-scale integration of data and text

Lars Juhl Jensen

This document discusses large-scale integration of biological data and text mining. It describes three main parts: association networks that connect entities based on "guilt by association", protein interaction networks built using data from STRING and 2000+ genomes, and using genomic context like gene fusion, gene neighborhood, and phylogenetic profiles. It then provides examples of using STRING to query protein networks and discusses challenges of text mining like the exponential growth of literature and limitations of current natural language processing. Finally, it describes the Jensen Lab's approach of integrating curated knowledge, experimental data, predictions, and data from databases like STRING, STITCH, PubChem, COMPARTMENTS, Gene Ontology, UniProtKB, and disease databases into a common framework with

STRING: Protein networks from data and text mining

Lars Juhl Jensen

This document discusses building protein networks through data and text mining. It describes integrating data from many databases on protein interactions and functional associations, which are in various formats and identifiers. Named entity recognition and co-mentioning are used to extract protein names and their relationships from text. The integrated data is then visualized in networks and databases like STRING provide this network data along with search and analysis tools through a web resource, files, and APIs.

STRING - Protein networks from data and text mining

Lars Juhl Jensen

This document discusses protein networks and how they can be constructed from data and text mining. It describes challenges like different data sources using different formats and identifiers and issues with data quality. It also outlines techniques used to parse the data, map identifiers, assign quality scores, and implicitly weight evidence by quality to build a comprehensive protein interaction network across all available sources. The resulting database is made freely available online as a web resource, downloadable files, and via an API and apps to facilitate its use.

Advanced bioinformaticsof proteomics datasets

Lars Juhl Jensen

This document discusses advanced bioinformatics approaches for analyzing proteomics datasets, including using signaling networks, association networks, and text mining. It describes using machine learning to predict protein interactions and developing scoring schemes to integrate data from multiple sources. The document also covers using text mining approaches like named entity recognition and information extraction to analyze the large amount of proteomics information available in scientific literature.

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

This document discusses how gene association networks are created by integrating large amounts of genomic data and text from many databases. Researchers develop parsers and mapping files to combine information about genes from various sources, which may have different formats and identifiers. They also use text mining to extract gene and protein associations from literature. The resulting association networks provide a comprehensive view of functional relationships between genes and are made available through online resources like STRING-DB.

Data and Text Mining

Lars Juhl Jensen

The document discusses Lars Juhl Jensen's work in data and text mining of biomedical literature and records to analyze protein networks, gene interactions, and predict relationships between genes and proteins. Jensen uses text mining techniques like named entity recognition and information extraction from millions of abstracts and articles to build resources on protein interactions, gene neighborhoods, and disease localization that are compiled on websites for public use and dissemination of the knowledge.

Network biology: Large-scale data and text mining

Lars Juhl Jensen

This document discusses network biology and large-scale text mining. It describes using computational predictions, experimental data, and text mining to build protein interaction networks for various species from databases with different formats and quality. It also discusses using named entity recognition, expansion rules, and flexible matching to extract information from millions of abstracts and articles to identify relationships between biological entities like proteins, complexes, pathways, tissues, compartments, and diseases. The extracted information is integrated into web interfaces and services to allow visualization and exploration of the biological networks and relationships.

STRING - Large-scale integration of data and text

Lars Juhl Jensen

This document discusses large-scale integration of biological data and text. It mentions combining data from many databases on proteins, interactions, complexes and pathways using parsers and mapping files to overcome different formats and identifiers. It discusses using techniques like co-mentioning within documents, paragraphs and sentences to provide comprehensive information and improve quality scores. The goal is to combine all available evidence from various sources to generate a comprehensive resource, as described on the string-db.org website and Cytoscape app.

Data integration: The STITCH database of protein–small molecule interactions

Lars Juhl Jensen

The document discusses data integration and summarizes the STITCH database, which integrates protein-small molecule interactions from several sources. It also summarizes a method for predicting novel drug targets using side effect information by analyzing similarities in side effect profiles between existing drugs. Predictions were tested in vitro and in cell assays, with promising results. The document acknowledges contributions from researchers involved in developing STITCH and the side effect prediction method.

STRING: Large-scale data and text mining

Lars Juhl Jensen

This document discusses large-scale data and text mining techniques used by STRING to build comprehensive protein association networks. STRING integrates information from genomic context, high-throughput experiments, co-expression and curated databases to assign a confidence score to each association. Natural language processing is applied to mine the scientific literature and extract entity and relation information from millions of articles and abstracts to expand the known protein association networks beyond curated knowledge. STRING is freely accessible online and allows users to perform queries and analyze networks for various organisms.

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

This document discusses gene association networks and large-scale integration of biological data and text. It describes using computational predictions from 2000+ genomes, gene fusions, phylogenetic profiles, experimental data like gene coexpression, and curated knowledge from pathways and databases to build networks. It highlights challenges like different data formats, identifiers, and quality, and how parsers and mapping files were used to integrate data onto a common scale. It also discusses using text mining techniques like named entity recognition and co-mentioning to extract protein and gene information from text and assign quality scores to associate genes with diseases, tissues, localization, and model organisms.

Large-scale integration of data and text

Lars Juhl Jensen

This document discusses large-scale integration of biological data from a variety of sources including experimental data, curated knowledge databases, and text mining of the scientific literature. It describes several databases that have been developed for mining protein interactions, chemical relationships, genomic and medical data. Natural language processing techniques are used to extract structured information from unstructured text and link entities and relationships across these different data sources to build molecular networks.

Systems biology: Bioinformatics on complete biological system

Lars Juhl Jensen

Systems biology uses mathematical modeling to study molecular networks and complete biological systems. It requires detailed knowledge of molecular interactions, which can be determined through various high-throughput interaction assays. However, interaction data from different databases may have varying quality and identifiers, so integrating this data requires resolving these issues. Natural language processing of literature can provide additional interaction data by recognizing named entities and extracting relations from text.

Network biology

Lars Juhl Jensen

This document discusses network biology and summarizes three parts: 1) it discusses protein networks, localization and diseases, and disease networks, 2) it outlines approaches to integrate data from computational predictions, experimental data, and curated knowledge, and 3) it describes a suite of web resources for exploring protein localization and disease associations based on these integrated data along with acknowledgments of collaborators and databases.

Similar to Data integration - Integration of functional associations using STRING (20)

The STRING database

Cross-species data integration

Integration of heterogeneous data

Protein interaction networks

STRING - Modeling of biological systems through cross-species data integ...

Integration of diverse large-scale datasets

Large-scale integration of data and text

STRING: Protein networks from data and text mining

STRING - Protein networks from data and text mining

Advanced bioinformaticsof proteomics datasets

Gene association networks - Large-scale integration of data and text

Data and Text Mining

Network biology: Large-scale data and text mining

STRING - Large-scale integration of data and text

Data integration: The STITCH database of protein–small molecule interactions

STRING: Large-scale data and text mining

Gene association networks - Large-scale integration of data and text

Large-scale integration of data and text

Systems biology: Bioinformatics on complete biological system

Network biology

More from Lars Juhl Jensen

One tagger, many uses: Illustrating the power of dictionary-based named entit...

Lars Juhl Jensen

This document summarizes a Twitter thread discussing the uses of a dictionary-based named entity recognition tool called Tagger. Tagger can recognize genes, proteins, diseases and other biomedical entities. It is open source, runs quickly processing over 1000 abstracts per second, and achieves 70-80% recall and 80-90% precision. Tagger has been applied to tasks like identifying drug-disease associations, adverse drug events, and protein-protein interactions. It is available as a Docker container or web service.

One tagger, many uses: Simple text-mining strategies for biomedicine

Lars Juhl Jensen

The document summarizes a text mining tool called a tagger that can be used for named entity recognition in biomedical texts. It recognizes genes, proteins, chemicals, diseases, and other entities. The tagger is open source, runs quickly at over 1000 abstracts per second, and has 70-80% recall and 80-90% precision. It comes with Python and Docker implementations and can be accessed via a web service. It is useful for tasks like extracting functional associations from literature and electronic health records.

Extract 2.0: Text-mining-assisted interactive annotation

Lars Juhl Jensen

This document describes Extract 2.0, a text-mining tool that can assist with interactive annotation of documents. It uses dictionary-based tagging to identify relevant entities like genes and diseases. It achieves 70-80% recall and 80-90% precision on entity extraction and was evaluated in BioCreative challenges where it received positive feedback from curators. The tool is open source and available as a web service or Python wrapper.

Network visualization: A crash course on using Cytoscape

Lars Juhl Jensen

STRING & STITCH: Network integration of heterogeneous data

Lars Juhl Jensen

The document discusses STRING and STITCH, two online databases that integrate data on protein-protein interactions, pathways, and functional associations from various sources. STRING collects data on over 9.6 million proteins and 430 thousand chemicals from sources like text mining, experimental assays, and co-expression analyses. It aims to provide a comprehensive global view of known and predicted protein associations. STITCH also integrates interaction data but focuses more on chemical-protein interactions. Both databases provide user-friendly web interfaces for browsing and visualizing interaction networks.

Biomedical text mining: Automatic processing of unstructured text

Lars Juhl Jensen

1) Lars Juhl Jensen discusses biomedical text mining and automatic processing of unstructured text such as patent literature, grant proposals, FDA product labels, and electronic medical records. 2) Named entity recognition is used to identify genes/proteins, chemical compounds, diseases, and other entities in text through comprehensive dictionaries and flexible matching rules that account for variations. 3) Relation extraction uses natural language processing techniques like part-of-speech tagging and sentence parsing along with manually crafted rules and machine learning to identify implicit relations between entities in text such as transcription factor targets, kinase substrates, and protein-protein interactions.

Medical network analysis: Linking diseases and genes through data and text mi...

Lars Juhl Jensen

The document summarizes the work of Lars Juhl Jensen and others on medical network analysis and linking diseases and genes through data and text mining of electronic health records. It discusses how they have used Danish national health registries containing data on over 6 million patients and 119 million diagnoses over 14 years to study disease trajectories and comorbidities. It also describes how they have developed methods to integrate data from various sources to generate networks linking diseases and genes.

Network Biology: A crash course on STRING and Cytoscape

Lars Juhl Jensen

This document provides an overview of STRING, a protein-protein association database, and Cytoscape, a network visualization tool. It describes how STRING contains functional associations between proteins derived from genomic context, co-expression and curated databases. Cytoscape can import STRING networks and external data to map onto nodes. It offers visualization of networks through layouts and attributes, and analysis through clustering, selection filters and enrichment. The document recommends using these tools together to explore protein association networks.

Cellular networks

Lars Juhl Jensen

This document discusses different approaches to visualizing cellular networks and the molecular interactions between proteins. It notes that there are many different types of data that could be shown, such as protein names, functions, localization, expression, modifications, and interaction types. However, it is impossible to show all this information at once. The document recommends using different visualizations like force-directed layouts to distribute proteins in 2D or lining up interactions in 1D. It acknowledges open challenges like showing time-course data and modification sites. In the end, the document thanks several researchers who have contributed to mapping and visualizing cellular networks.

Cellular Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

The document discusses various community resources and software tools for integrating large-scale data and text, including STRING for protein networks, STITCH for chemical networks, COMPARTMENTS for subcellular localization, TISSUES for tissue expression, and DISEASES for disease associations. It provides an overview of text mining techniques used to extract information from literature to build networks in these resources. The presenter demonstrates the Cytoscape App which can import and analyze networks from STRING, perform queries, and analyze subcellular localization, tissue expression, and disease enrichment.

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Lars Juhl Jensen

This document discusses statistical methods for analyzing high-throughput biomedical screens and common pitfalls. It introduces several statistical tests such as t-tests, ANOVA, Fisher's exact test, and the Mann-Whitney U test. It also discusses challenges like multiple testing, resampling techniques, and biases that can occur like studiedness bias and abundance bias in big data analyses. Controlling false discovery rates and considering effect sizes are recommended over solely relying on p-values to determine biological significance.

STRING & related databases: Large-scale integration of heterogeneous data

Lars Juhl Jensen

The document discusses the STRING database, which integrates heterogeneous biological data to generate association networks for proteins. It describes how STRING collects and connects curated knowledge, experimental data, and predicted interactions from genomic context, co-expression and text mining. The document also outlines exercises for users to explore protein-protein associations in STRING and related databases that integrate data on subcellular localization, tissue expression, and disease associations.

Tagger: Rapid dictionary-based named entity recognition

Lars Juhl Jensen

Tagger is a named entity recognition tool that can process over 1000 abstracts per second using a dictionary-based approach. It achieves 70-80% recall and 80-90% precision using comprehensive dictionaries, expansion rules, and a curated blacklist to identify entity types like genes, proteins, chemicals, and diseases. The tool has a C++ engine, is inherently thread-safe, and includes interactive annotation, Python wrappers, and a REST API.

Medical text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

This document discusses medical text mining and linking diseases, drugs, and adverse reactions. It describes using text mining on clinical narratives in Danish to recognize named entities like drugs and diseases, identify relationships between them like adverse drug reactions, and discover new ADRs. The goal is to generate structured data on topics like comorbidities, diagnosis trajectories, and reimbursement to supplement limited structured data and help busy doctors by analyzing large amounts of unstructured text.

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

The document discusses network biology and large-scale data integration. It describes protein-protein interaction networks like STRING that integrate data from curated knowledge, experiments, and predictions. It provides exercises to explore the human insulin receptor (INSR) in STRING, examining the types of evidence that support its interaction with IRS1. It also introduces other integrated networks like STITCH for chemicals and COMPARTMENTS for subcellular localization. Natural language processing techniques like named entity recognition, information extraction, and semantic tagging are used to integrate text data from the literature into these interaction networks.

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

This document discusses medical data and text mining to link diseases, drugs, and adverse reactions. It describes using structured data from Danish central registries and unstructured data from hospital electronic health records. Named entity recognition is used to extract diseases, drugs, and adverse reactions from free text clinical notes written in Danish. Hand-crafted rules are developed to identify relationships between extracted entities like adverse drug reactions. This allows estimating frequencies of known adverse drug reactions and discovering new adverse drug reactions by analyzing diagnosis trajectories and medication information.

Cellular Network Biology

Lars Juhl Jensen

This document discusses cellular network biology and summarizes several key papers on topics like proteome analysis using mass spectrometry, integrating protein network and experimental data, challenges with different biological databases having varying formats and quality, and using natural language processing techniques like named entity recognition and relation extraction to analyze medical text for information like diagnosis trajectories and adverse drug reactions.

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

This document discusses natural language processing (NLP) techniques for extracting information from biomedical literature and integrating it with network and interaction data. It describes how NLP is used to identify entities like genes and proteins, extract relationships between entities, and integrate this text-mined information with existing interaction networks from databases like STRING to expand knowledge of protein interactions, complexes, pathways and associations with diseases. The document provides examples of using NLP analysis on sentences and the STRING and Tissues databases to explore tissue specificity and disease relationships for insulin and the insulin receptor.

Biomarker bioinformatics: Network-based candidate prioritization

Lars Juhl Jensen

The document discusses three parts of biomarker bioinformatics: data integration from multiple databases, text mining of scientific literature, and using that integrated data to prioritize biomarker candidates. It describes combining data on 9.6 million proteins from curated databases, using text mining to extract named entities from over 10,000 papers, and then using network and heat diffusion approaches to rank candidates based on evidence in the integrated data. The goal is to help identify new biomarker candidates from large amounts of biological data.

The Art of Counting: Scoring and ranking co-occurrences in literature

Lars Juhl Jensen

The document discusses methods for scoring and ranking co-occurrences of entities like diseases and genes in literature. It describes counting co-occurrences within different text levels like documents, paragraphs and sentences, and using techniques like z-score transformations and weighted combinations that can rank entities for a given query without changing the overall ranking. The methods have been implemented in web tools that can return results for queries within seconds using preprocessed named entity recognition results stored in a relational database.