Integration of biomedical literature and databases

•Download as PPT, PDF•

5 likes•354 views

2nd European Conference on Scientific Publishing in Biomedicine and Medicine, Rikshospitalet, Oslo, Norway, September 5-6, 2008

Integration of biomedical literature and databases Lars Juhl Jensen EMBL Heidelberg

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Acknowledgments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

This document summarizes databases and computational methods for exploring protein and chemical interactions. It introduces STRING and STITCH, which integrate various sources of data to predict interactions. STRING contains information on protein-protein interactions for 373 genomes. STITCH contains data on interactions between proteins and chemicals, including drugs. The document provides an example of using NetworKIN to predict kinase-substrate relationships and discusses how these databases and methods can provide insights into interaction networks and biological functions.

Biological literature mining - from information retrieval to biological disco...

Lars Juhl Jensen

Biomedical literature mining (and why we really need open access)

Lars Juhl Jensen

Biomedical literature mining

Lars Juhl Jensen

This document discusses various methodologies for extracting information from biological literature, including information retrieval, entity recognition, information extraction, and text/data mining. It provides an overview of different approaches like using co-occurrence, natural language processing, and machine learning methods. It also discusses challenges like integrating text with other data types and dealing with issues like ambiguity. Examples of existing text mining tools and their potential applications are also described.

Literature Mining and Systems Biology

Lars Juhl Jensen

The document discusses various techniques for literature mining and systems biology including information retrieval, entity recognition, information extraction, text mining, and integration of text and biological data. It provides examples and status of different methods, from established techniques for information retrieval, entity recognition, and simple information extraction to improving advanced natural language processing-based information extraction and methods for text mining and integration of text and data.

Literature mining and large-scale data integration

Lars Juhl Jensen

Literature mining: what is it, and should I care?

Lars Juhl Jensen

The document outlines a text mining exercise to identify human proteins mentioned in biomedical abstracts and link them to diseases. It discusses using named entity recognition on two sets of abstracts about prostate cancer and schizophrenia to extract proteins and link them to the diseases. The document provides information on the dictionary and tagdir program used to perform the task and highlights example proteins linked to both prostate cancer and schizophrenia as results.

Text mining

Lars Juhl Jensen

Text mining can summarize scientific documents in 3 sentences or less by identifying key entities and relationships. It recognizes concepts like genes, proteins, diseases and extracts facts from text. This extracted information can then be integrated with other data to create more useful resources and provide novel insights through augmented browsing and analysis. Text mining aims to make navigating vast amounts of scientific literature simpler and less boring.

Ontologies for life sciences: examples from the gene ontology

Melanie Courtot

Ontologies for life sciences: examples from the Gene Ontology The document discusses ontologies for life sciences, using the Gene Ontology (GO) as an example. It provides an overview of GO, describing it as a way to capture biological knowledge for gene products in a written and computable form using a set of concepts and relationships arranged hierarchically. GO allows consistent descriptions of genes/gene products across databases. Model organism databases provide annotations connecting genes to GO terms. The GO is a collaborative effort to address the need for consistent descriptions of genes.

Cross-species gene normalization by species inference

Raunak Shrestha

GenNorm is a method for gene normalization that handles gene mention variations, orthologous gene ambiguity, and intra-species gene ambiguity. It uses three modules: a gene name recognition module, a species assignation module, and a species-specific gene normalization module. The gene name recognition module identifies gene mentions and associates database identifiers. The species assignation module assigns species using lexicons. The species-specific gene normalization module measures inference scores of candidate identifiers in articles. GenNorm achieved good performance on two test datasets of full-text articles.

Variant (SNPs/Indels) calling in DNA sequences, Part 1

Denis C. Bauer

This document discusses various topics related to mapping short sequencing reads to a reference genome, including: - File formats like FASTQ that store sequencing reads and BAM/SAM formats for aligned reads. - Alignment algorithms like hash table-based (MAQ, BWA) and suffix tree-based (BWA, Bowtie) mappers. - Visualizing alignments using the Integrative Genomics Viewer (IGV). - Performing quality control on BAM files by checking the percentage of mapped reads and coverage uniformity. - The next session will focus on identifying genomic variants from mapped reads through SNP/indel calling and filtering.

Real-time tagging of biomedical entities

Lars Juhl Jensen

The document describes a real-time biomedical entity tagger developed in C++ that can tag entities in abstracts in under 0.001 seconds. It uses a custom hash table and is inherently thread-safe and scalable. A Python module and HTTP server were also created to allow the tagger to be used as a web service using a thread pool and priority queue. The tagger can identify various biomedical entities from a dictionary and has been applied to tools for augmented browsing and interactive annotation. Plans exist to improve the REST interface and support additional annotation standards.

Open zika presentation

Sean Ekins

Gene Wiki and Mark2Cure update for BD2K

Benjamin Good

The document discusses using crowdsourcing via platforms like Amazon Mechanical Turk and Mark2Cure to extract information from biomedical literature at scale. It summarizes experiments showing non-experts can accurately recognize disease concepts in PubMed abstracts when aggregated. The author proposes expanding this approach to identify genes, drugs, diseases and relationships to build a computable network of biomedical knowledge from the literature. Funding sources and collaborators supporting various related projects are acknowledged at the end.

Text mining

Lars Juhl Jensen

Text mining techniques can be used to extract information and insights from the exponential growth of scientific literature. Key techniques include information retrieval to find relevant papers, named entity recognition to identify concepts, and information extraction to formalize facts. These techniques can be evaluated using benchmarking against manually annotated corpora, though creating such resources requires significant effort and the pragmatic approach of inspecting text mining outputs is much less work.

Applied text mining

Lars Juhl Jensen

This document discusses various techniques in applied text mining, including named entity recognition, information extraction, and text/data integration. It covers extracting facts from text using natural language processing approaches like part-of-speech tagging and semantic tagging. It also discusses more pragmatic approaches using techniques like co-mentioning and guilt by association. The goal is to formalize biological facts and integrate text-derived information with databases of experimental data and computational predictions to build more comprehensive resources. Challenges include dealing with different data formats, identifiers, and quality across the many available databases.

eXframe: A Semantic Web Platform for Genomic Experiments

Tim Clark

eXframe is a reusable framework for creating online repositories of genomics experiments. It uses Drupal to structure annotations of experiments, biomaterials, and assays. eXframe automatically publishes this data as RDF and provides a SPARQL endpoint. The first instance is the Stem Cell Commons, which deeply annotates experiments, organisms, tissues, and more using ontologies. It allows flexible querying of the data via SPARQL and integration with other endpoints. eXframe creates both public and private RDF stores to selectively share experimental data with researchers.

exFrame: a Semantic Web Platform for Genomics Experiments

Tim Clark

Valeria lab 6lv

valrivera

This document summarizes a lab assignment on bioinformatics. Students were asked to practice using bioinformatics tools like BLAST and databases from NCBI to analyze DNA sequences. For the assignment, students were given a scenario where they had to use these tools to identify which staff members at a research project were illegally using DNA from an endangered primate species. Students performed BLAST searches comparing DNA samples from staff members to a sequence from the endangered primate provided by another company, identifying that two staff members were implicated.

The Gene Ontology & Gene Ontology Annotation resources

Melanie Courtot

The Gene Ontology (GO) provides structured controlled vocabularies for describing gene and gene product attributes across species. It includes three ontologies for molecular function, biological process, and cellular component. The GO is manually developed and electronically annotated to gene products to capture biological knowledge in a computable form. The GO Consortium aims to develop and maintain the GO through manual and computational methods, and to provide public GO annotation data and tools.

Computational Resources In Infectious Disease

João André Carriço

This document summarizes bioinformatics tools that can be used for analysis of high-throughput sequencing data for molecular diagnostics. It discusses databases for virulence factors and antimicrobial resistance as well as tools for assembly, annotation, pan-genome analysis, visualization, and commercial solutions. The presentation emphasizes that there is no single best tool and different approaches are needed for different questions. Collaboration with other researchers is recommended.

Databases ii

Sucheta Tripathy

The document discusses various types of biological databases including sequence databases, structure databases, genome databases, and model organism databases. It provides examples of nucleotide databases like Genbank, DDBJ, EMBL-EBI, and TIGR. Genome browsers like UCSC Genome Browser, Ensembl browser, and Integrated Genome Browser are also mentioned. Other topics covered include the Encyclopedia of Life, India Biodiversity, Barcode of Life, data retrieval schemes, bibliographic databases, and database journals.

Light Intro to the Gene Ontology

nniiicc

The Gene Ontology (GO) provides a controlled vocabulary for describing gene and gene product attributes across species. It consists of three ontologies covering biological processes, molecular functions, and cellular components. GO terms are organized into a directed acyclic graph structure and can have relationships like "is_a" and "part_of". Genes are annotated with GO terms to capture functional information, which is shared across species to facilitate research. While useful, the GO has some limitations like unclear reasoning principles and lack of validation procedures.

Fairport domain specific metadata using w3 c dcat & skos w ontology views

Tim Clark

FAIRPORT is an international project to develop a lightweight interoperability architecture for biomedical - and potentially other - data repositories. This slide deck is a presentation to the FAIRPORT technical team. It describes a proposed model for supporting domain-specific search metadata using a common schema model across all repositories. The proposal makes use of the following existing technologies, with minor extensions: - the W3C DCAT model for dataset description - the W3C SKOS knowledge organization system - OWL2 Ontology Language - Dublin Core Vocabulary - NCBO Bioportal biomedical ontologies collection

Gene Ontology Project

vaibhavdeoda

The document discusses two programs - BLASTing AmiGOs and "33" - that were designed to automatically generate Gene Ontology (GO) terms from gene/protein sequences. BLASTing AmiGOs takes FASTA sequences as input and outputs the associated GO terms without manual input. "33" queries a GO database using gene products from another group to retrieve GO terms and evidence codes. Manually collecting the same GO term data for 32 genes took 4-5 hours, while the programs could generate the terms automatically. The document compares the manual and automated methods and discusses using computational tools to help biologists more efficiently organize and access expanding genomic data.

Biomedical text mining

Lars Juhl Jensen

This document discusses biomedical text mining techniques used to extract information from scientific papers. It covers named entity recognition to identify concepts like proteins, chemicals and diseases. It also discusses information extraction to formalize facts stated in text, such as interactions between biological components. Techniques include co-mentioning analysis and natural language processing and tools have been applied to large text corpora to aid discovery.

Clinical materials for medicine VI

Dr Ajith Karawita

Cardiac failure, cyanotic heart disease D. Liver disease – cirrhosis of liver E. Inflammatory bowel disease F. Miscellaneous – cystic fibrosis, rheumatoid arthritis, thyroid disorders, chronic renal failure Grading of finger clubbing 1. Early clubbing – angle between nailbed and finger < 180 degrees 2. Moderate clubbing – angle between 180-200 degrees 3. Marked clubbing – angle > 200 degrees Clubbing is usually bilateral and symmetric. It is graded by measuring the angle between nailbed and finger using a protractor.

Presentation at Sri Lanka college of venereologists 2011

Dr Ajith Karawita

This document summarizes a study that mapped and estimated the sizes of female sex worker (FSW) and men who have sex with men (MSM) populations in Anuradhapura District, Sri Lanka. The study used a geographic mapping methodology involving key informant interviews (Level 1) to identify hot spots, which were then validated (Level 2). Final estimates were 1,138 FSWs and 729 MSM in the district. The study aimed to provide data to help plan HIV prevention programs for these most-at-risk populations.

What's hot

Text mining exercise

Lars Juhl Jensen

Text mining

Lars Juhl Jensen

Ontologies for life sciences: examples from the gene ontology

Melanie Courtot

Cross-species gene normalization by species inference

Raunak Shrestha

Variant (SNPs/Indels) calling in DNA sequences, Part 1

Denis C. Bauer

Real-time tagging of biomedical entities

Lars Juhl Jensen

Open zika presentation

Sean Ekins

Gene Wiki and Mark2Cure update for BD2K

Benjamin Good

Text mining

Lars Juhl Jensen

Applied text mining

Lars Juhl Jensen

eXframe: A Semantic Web Platform for Genomic Experiments

Tim Clark

exFrame: a Semantic Web Platform for Genomics Experiments

Tim Clark

Valeria lab 6lv

valrivera

The Gene Ontology & Gene Ontology Annotation resources

Melanie Courtot

Computational Resources In Infectious Disease

João André Carriço

Databases ii

Sucheta Tripathy

Light Intro to the Gene Ontology

nniiicc

Fairport domain specific metadata using w3 c dcat & skos w ontology views

Tim Clark

Gene Ontology Project

vaibhavdeoda

Biomedical text mining

Lars Juhl Jensen

What's hot (20)

Text mining exercise

Text mining

Ontologies for life sciences: examples from the gene ontology

Cross-species gene normalization by species inference

Variant (SNPs/Indels) calling in DNA sequences, Part 1

Real-time tagging of biomedical entities

Open zika presentation

Gene Wiki and Mark2Cure update for BD2K

Text mining

Applied text mining

eXframe: A Semantic Web Platform for Genomic Experiments

exFrame: a Semantic Web Platform for Genomics Experiments

Valeria lab 6lv

The Gene Ontology & Gene Ontology Annotation resources

Computational Resources In Infectious Disease

Databases ii

Light Intro to the Gene Ontology

Fairport domain specific metadata using w3 c dcat & skos w ontology views

Gene Ontology Project

Biomedical text mining

Viewers also liked

Clinical materials for medicine VI

Dr Ajith Karawita

Presentation at Sri Lanka college of venereologists 2011

Dr Ajith Karawita

Piwowar AMIA 2008: Identifying data sharing in biomedical literature

Heather Piwowar

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Size estimation of most at risk populations

Dr Ajith Karawita

Indexing of biomedical literature

Dr Ajith Karawita

The document discusses indexing of biomedical literature. It begins with background information on what constitutes an article and the concept of publishing. It then defines what a citation is, including citation contents, styles, and identifiers. It also discusses referencing methods and plagiarism. The document then describes cataloging and indexing, including major indexing services like PubMed and Index Medicus provided by the National Library of Medicine.

Sri lankan experience on reduction of hiv stigma and discrimination among hea...

Dr Ajith Karawita

Alexaxposureomm

Viewers also liked (7)

Clinical materials for medicine VI

Presentation at Sri Lanka college of venereologists 2011

Piwowar AMIA 2008: Identifying data sharing in biomedical literature

Size estimation of most at risk populations

Indexing of biomedical literature

Sri lankan experience on reduction of hiv stigma and discrimination among hea...

Alexa

Similar to Integration of biomedical literature and databases

Mining literature and medical records

Lars Juhl Jensen

This document discusses mining literature and medical records using text mining techniques. It summarizes that text mining can be used to extract relevant information from large collections of scientific papers and medical records by using techniques like named entity recognition to identify concepts, information extraction to formalize stated facts, and analyzing co-mentioning of entities to find relationships. Challenges include the unstructured nature of medical records, differences between languages and formats, and privacy concerns when using patient health information. When applied carefully, text mining of literature and medical records can help identify new relationships and insights not captured in existing curated databases or help with medical research questions.

Text mining and data integration

Lars Juhl Jensen

This document discusses text mining and data integration techniques used to extract information from biomedical literature and databases. It describes named entity recognition to identify concepts, co-mentioning analysis to find associations between entities, and using these methods along with experimental data and predictions to build integrated networks of genes and proteins and their relationships. These networks are made accessible through web resources that unify data from various sources under common identifiers and provide visualization and programmatic access.

Applied text mining

Lars Juhl Jensen

This document discusses natural language processing and text mining techniques for biomedical literature and electronic health records. It describes named entity recognition to identify concepts like genes and proteins, relation extraction to find interactions between entities, and information extraction to formalize stated facts. It also discusses integrating extracted information with structured databases and visualizing relationships through web interfaces. Medical text mining can apply these techniques to clinical notes to identify diseases, drugs, adverse events and more for applications like comorbidity analysis, patient stratification, and pharmacovigilance.

Network biology: Large-scale data and text mining

Lars Juhl Jensen

This document discusses network biology and large-scale data and text mining. It describes how Lars Jensen uses computational predictions from over 1100 genomes along with experimental data and information extracted from text to build protein-protein association networks in STRING. These networks integrate known and predicted protein-protein interactions with functional associations, and are used to study biological systems at the network level.

Systems biology - Understanding biology at the systems level

Lars Juhl Jensen

The document discusses systems biology and its goal of understanding biology at the systems level. It explains that systems biology studies complete biological systems by integrating multiple types of high-throughput omics data and mathematical modeling. It provides examples of modeling the cell cycle and integrating gene expression, protein interaction, and genetic interaction networks to understand complex multi-layer regulation within biological systems. Interactive online databases are described that allow users to explore omics data, expand networks, and investigate relationships between biological entities and diseases.

The STRING database and related tools

Lars Juhl Jensen

The document discusses the STRING database and related tools for exploring protein-protein association networks, gene neighborhoods, phylogenetic profiles, and other computational predictions and experimental data. It notes that individual databases cover different species and formats, and have variable quality. STRING aims to integrate these resources using common identifiers, quality scores, and text mining while calibrating scores against experimental data and curated knowledge. Resources discussed include STRING for protein networks, STITCH for chemical networks, and COMPARTMENTS and TISSUES for subcellular localization and tissue expression data.

STRING: Large-scale data and text mining

Lars Juhl Jensen

This document discusses large-scale data and text mining techniques used by STRING to build comprehensive protein association networks. STRING integrates information from genomic context, high-throughput experiments, co-expression and curated databases to assign a confidence score to each association. Natural language processing is applied to mine the scientific literature and extract entity and relation information from millions of articles and abstracts to expand the known protein association networks beyond curated knowledge. STRING is freely accessible online and allows users to perform queries and analyze networks for various organisms.

Computational approaches to cell cycle analysis: Current research topics (tho...

Lars Juhl Jensen

Networks of proteins and diseases

Lars Juhl Jensen

The document discusses Lars Juhl Jensen's research using networks of proteins and diseases. His lab uses text mining of biomedical literature, curated databases, and experimental data to build protein-protein interaction networks. These networks are then used to study relationships between proteins, diseases, tissues, and cellular compartments. Jensen's lab has created web interfaces and databases to disseminate the results of their computational predictions and analyses of disease networks. They also use medical data like electronic health records to study relationships between diseases and adverse drug reactions.

Integration of heterogeneous data

Lars Juhl Jensen

Large-scale data and text mining

Lars Juhl Jensen

This document discusses network biology and text mining of large datasets to analyze protein and medical networks. It describes using techniques like named entity recognition, information extraction, and natural language processing on text corpora with millions of abstracts and articles to identify relationships between genes, proteins, and medical entities. The text also discusses using these methods to analyze protein interaction and medical diagnosis trajectory data to gain biological and medical insights.

Systems biology - Bioinformatics on complete biological systems

Lars Juhl Jensen

This document discusses systems biology and bioinformatics. It describes how systems biology takes a holistic approach to study complete biological systems and all of their components and interactions. In contrast, earlier approaches in biology focused on studying one gene or protein at a time. The document outlines several key subfields and approaches within systems biology, including mathematical modeling of biological networks and pathways, data integration from various sources, and the use of association networks to predict functional relationships between biomolecules. It provides examples of publicly available databases like STRING and STITCH that compile interaction and association data from multiple sources for large numbers of organisms. The challenges of data integration are also discussed due to issues like incompatible identifiers and variable data quality across sources. The document then focuses on

STRING & related databases: Large-scale integration of heterogeneous data

Lars Juhl Jensen

The document discusses the STRING database, which integrates heterogeneous biological data to generate association networks for proteins. It describes how STRING collects and connects curated knowledge, experimental data, and predicted interactions from genomic context, co-expression and text mining. The document also outlines exercises for users to explore protein-protein associations in STRING and related databases that integrate data on subcellular localization, tissue expression, and disease associations.

Large-scale data and text mining - Linking proteins, chemicals, and side effects

Lars Juhl Jensen

This document discusses using data mining and text mining techniques to link proteins, chemicals, and side effects in molecular interaction networks. It provides examples of using the STRING and STITCH databases to explore protein and chemical networks. It also discusses how text mining of biomedical literature and electronic health records can help identify molecular interactions, adverse drug reactions, and support drug repurposing efforts.

Cross-species data integration

Lars Juhl Jensen

Large-scale integration of data and text

Lars Juhl Jensen

This document discusses large-scale integration of data and text in bioinformatics. It describes using text mining on millions of abstracts and articles to extract information on biological entities and their associations in order to build networks of proteins, genes, diseases and small molecules. This information is integrated with experimental data and computational predictions into web-centric databases and resources that can help researchers by saving them time over manually reviewing the literature. Visualization tools are also provided to project network data onto tissue and subcellular localization information extracted from text.

Systems biology: Bioinformatics on complete biological system

Lars Juhl Jensen

Systems biology uses mathematical modeling to study molecular networks and complete biological systems. It requires detailed knowledge of molecular interactions, which can be determined through various high-throughput interaction assays. However, interaction data from different databases may have varying quality and identifiers, so integrating this data requires resolving these issues. Natural language processing of literature can provide additional interaction data by recognizing named entities and extracting relations from text.

Integration of biomedical data and electronic publications

Lars Juhl Jensen

Systematic discovery of phosphorylation networks - Combining linear motifs an...

Lars Juhl Jensen

Networks of proteins and diseases

Lars Juhl Jensen

The document discusses networks of proteins and diseases. It describes several databases and methods that can be used to integrate information about protein-protein interactions, computational predictions, experimental data, and text mining results to build networks linking proteins and diseases. These include STRING, which contains known and predicted protein interactions, and methods using gene fusion, conserved neighborhood, and co-mentioning in text to predict additional interactions. The document also describes web resources and databases developed by the author's lab to catalog protein localization, tissue expression, and disease associations based on integrating these different data sources and association networks.

Similar to Integration of biomedical literature and databases (20)

Mining literature and medical records

Text mining and data integration

Applied text mining

Network biology: Large-scale data and text mining

Systems biology - Understanding biology at the systems level

The STRING database and related tools

STRING: Large-scale data and text mining

Computational approaches to cell cycle analysis: Current research topics (tho...

Networks of proteins and diseases

Integration of heterogeneous data

Large-scale data and text mining

Systems biology - Bioinformatics on complete biological systems

STRING & related databases: Large-scale integration of heterogeneous data

Large-scale data and text mining - Linking proteins, chemicals, and side effects

Cross-species data integration

Large-scale integration of data and text

Systems biology: Bioinformatics on complete biological system

Integration of biomedical data and electronic publications

Systematic discovery of phosphorylation networks - Combining linear motifs an...

Networks of proteins and diseases

More from Lars Juhl Jensen

One tagger, many uses: Illustrating the power of dictionary-based named entit...

Lars Juhl Jensen

This document summarizes a Twitter thread discussing the uses of a dictionary-based named entity recognition tool called Tagger. Tagger can recognize genes, proteins, diseases and other biomedical entities. It is open source, runs quickly processing over 1000 abstracts per second, and achieves 70-80% recall and 80-90% precision. Tagger has been applied to tasks like identifying drug-disease associations, adverse drug events, and protein-protein interactions. It is available as a Docker container or web service.

One tagger, many uses: Simple text-mining strategies for biomedicine

Lars Juhl Jensen

The document summarizes a text mining tool called a tagger that can be used for named entity recognition in biomedical texts. It recognizes genes, proteins, chemicals, diseases, and other entities. The tagger is open source, runs quickly at over 1000 abstracts per second, and has 70-80% recall and 80-90% precision. It comes with Python and Docker implementations and can be accessed via a web service. It is useful for tasks like extracting functional associations from literature and electronic health records.

Extract 2.0: Text-mining-assisted interactive annotation

Lars Juhl Jensen

This document describes Extract 2.0, a text-mining tool that can assist with interactive annotation of documents. It uses dictionary-based tagging to identify relevant entities like genes and diseases. It achieves 70-80% recall and 80-90% precision on entity extraction and was evaluated in BioCreative challenges where it received positive feedback from curators. The tool is open source and available as a web service or Python wrapper.

Network visualization: A crash course on using Cytoscape

Lars Juhl Jensen

STRING & STITCH: Network integration of heterogeneous data

Lars Juhl Jensen

The document discusses STRING and STITCH, two online databases that integrate data on protein-protein interactions, pathways, and functional associations from various sources. STRING collects data on over 9.6 million proteins and 430 thousand chemicals from sources like text mining, experimental assays, and co-expression analyses. It aims to provide a comprehensive global view of known and predicted protein associations. STITCH also integrates interaction data but focuses more on chemical-protein interactions. Both databases provide user-friendly web interfaces for browsing and visualizing interaction networks.

Biomedical text mining: Automatic processing of unstructured text

Lars Juhl Jensen

1) Lars Juhl Jensen discusses biomedical text mining and automatic processing of unstructured text such as patent literature, grant proposals, FDA product labels, and electronic medical records. 2) Named entity recognition is used to identify genes/proteins, chemical compounds, diseases, and other entities in text through comprehensive dictionaries and flexible matching rules that account for variations. 3) Relation extraction uses natural language processing techniques like part-of-speech tagging and sentence parsing along with manually crafted rules and machine learning to identify implicit relations between entities in text such as transcription factor targets, kinase substrates, and protein-protein interactions.

Medical network analysis: Linking diseases and genes through data and text mi...

Lars Juhl Jensen

The document summarizes the work of Lars Juhl Jensen and others on medical network analysis and linking diseases and genes through data and text mining of electronic health records. It discusses how they have used Danish national health registries containing data on over 6 million patients and 119 million diagnoses over 14 years to study disease trajectories and comorbidities. It also describes how they have developed methods to integrate data from various sources to generate networks linking diseases and genes.

Network Biology: A crash course on STRING and Cytoscape

Lars Juhl Jensen

This document provides an overview of STRING, a protein-protein association database, and Cytoscape, a network visualization tool. It describes how STRING contains functional associations between proteins derived from genomic context, co-expression and curated databases. Cytoscape can import STRING networks and external data to map onto nodes. It offers visualization of networks through layouts and attributes, and analysis through clustering, selection filters and enrichment. The document recommends using these tools together to explore protein association networks.

Cellular networks

Lars Juhl Jensen

This document discusses different approaches to visualizing cellular networks and the molecular interactions between proteins. It notes that there are many different types of data that could be shown, such as protein names, functions, localization, expression, modifications, and interaction types. However, it is impossible to show all this information at once. The document recommends using different visualizations like force-directed layouts to distribute proteins in 2D or lining up interactions in 1D. It acknowledges open challenges like showing time-course data and modification sites. In the end, the document thanks several researchers who have contributed to mapping and visualizing cellular networks.

Cellular Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

The document discusses various community resources and software tools for integrating large-scale data and text, including STRING for protein networks, STITCH for chemical networks, COMPARTMENTS for subcellular localization, TISSUES for tissue expression, and DISEASES for disease associations. It provides an overview of text mining techniques used to extract information from literature to build networks in these resources. The presenter demonstrates the Cytoscape App which can import and analyze networks from STRING, perform queries, and analyze subcellular localization, tissue expression, and disease enrichment.

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Lars Juhl Jensen

This document discusses statistical methods for analyzing high-throughput biomedical screens and common pitfalls. It introduces several statistical tests such as t-tests, ANOVA, Fisher's exact test, and the Mann-Whitney U test. It also discusses challenges like multiple testing, resampling techniques, and biases that can occur like studiedness bias and abundance bias in big data analyses. Controlling false discovery rates and considering effect sizes are recommended over solely relying on p-values to determine biological significance.

Tagger: Rapid dictionary-based named entity recognition

Lars Juhl Jensen

Tagger is a named entity recognition tool that can process over 1000 abstracts per second using a dictionary-based approach. It achieves 70-80% recall and 80-90% precision using comprehensive dictionaries, expansion rules, and a curated blacklist to identify entity types like genes, proteins, chemicals, and diseases. The tool has a C++ engine, is inherently thread-safe, and includes interactive annotation, Python wrappers, and a REST API.

Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

Lars Juhl Jensen leads a group that conducts large-scale integration of biological and medical data using proteomics, text mining, and medical data mining. The group develops protein interaction networks, disease networks, and association networks. They collaborate internationally on projects involving over 9.6 million proteins and 2000 genomes. The group works to integrate data from many sources in different formats to build comprehensive networks and knowledgebases, and also mines biomedical text to link genes and proteins with diseases.

Medical text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

This document discusses medical text mining and linking diseases, drugs, and adverse reactions. It describes using text mining on clinical narratives in Danish to recognize named entities like drugs and diseases, identify relationships between them like adverse drug reactions, and discover new ADRs. The goal is to generate structured data on topics like comorbidities, diagnosis trajectories, and reimbursement to supplement limited structured data and help busy doctors by analyzing large amounts of unstructured text.

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

The document discusses network biology and large-scale data integration. It describes protein-protein interaction networks like STRING that integrate data from curated knowledge, experiments, and predictions. It provides exercises to explore the human insulin receptor (INSR) in STRING, examining the types of evidence that support its interaction with IRS1. It also introduces other integrated networks like STITCH for chemicals and COMPARTMENTS for subcellular localization. Natural language processing techniques like named entity recognition, information extraction, and semantic tagging are used to integrate text data from the literature into these interaction networks.

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

This document discusses medical data and text mining to link diseases, drugs, and adverse reactions. It describes using structured data from Danish central registries and unstructured data from hospital electronic health records. Named entity recognition is used to extract diseases, drugs, and adverse reactions from free text clinical notes written in Danish. Hand-crafted rules are developed to identify relationships between extracted entities like adverse drug reactions. This allows estimating frequencies of known adverse drug reactions and discovering new adverse drug reactions by analyzing diagnosis trajectories and medication information.

Cellular Network Biology

Lars Juhl Jensen

This document discusses cellular network biology and summarizes several key papers on topics like proteome analysis using mass spectrometry, integrating protein network and experimental data, challenges with different biological databases having varying formats and quality, and using natural language processing techniques like named entity recognition and relation extraction to analyze medical text for information like diagnosis trajectories and adverse drug reactions.

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

This document discusses natural language processing (NLP) techniques for extracting information from biomedical literature and integrating it with network and interaction data. It describes how NLP is used to identify entities like genes and proteins, extract relationships between entities, and integrate this text-mined information with existing interaction networks from databases like STRING to expand knowledge of protein interactions, complexes, pathways and associations with diseases. The document provides examples of using NLP analysis on sentences and the STRING and Tissues databases to explore tissue specificity and disease relationships for insulin and the insulin receptor.

Biomarker bioinformatics: Network-based candidate prioritization

Lars Juhl Jensen

The document discusses three parts of biomarker bioinformatics: data integration from multiple databases, text mining of scientific literature, and using that integrated data to prioritize biomarker candidates. It describes combining data on 9.6 million proteins from curated databases, using text mining to extract named entities from over 10,000 papers, and then using network and heat diffusion approaches to rank candidates based on evidence in the integrated data. The goal is to help identify new biomarker candidates from large amounts of biological data.

The Art of Counting: Scoring and ranking co-occurrences in literature

Lars Juhl Jensen

The document discusses methods for scoring and ranking co-occurrences of entities like diseases and genes in literature. It describes counting co-occurrences within different text levels like documents, paragraphs and sentences, and using techniques like z-score transformations and weighted combinations that can rank entities for a given query without changing the overall ranking. The methods have been implemented in web tools that can return results for queries within seconds using preprocessed named entity recognition results stored in a relational database.

More from Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...

One tagger, many uses: Simple text-mining strategies for biomedicine

Extract 2.0: Text-mining-assisted interactive annotation

Network visualization: A crash course on using Cytoscape

STRING & STITCH: Network integration of heterogeneous data

Biomedical text mining: Automatic processing of unstructured text

Medical network analysis: Linking diseases and genes through data and text mi...

Network Biology: A crash course on STRING and Cytoscape

Cellular networks

Cellular Network Biology: Large-scale integration of data and text

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Tagger: Rapid dictionary-based named entity recognition

Network Biology: Large-scale integration of data and text

Medical text mining: Linking diseases, drugs, and adverse reactions

Network biology: Large-scale integration of data and text

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Cellular Network Biology

Network biology: Large-scale integration of data and text

Biomarker bioinformatics: Network-based candidate prioritization

The Art of Counting: Scoring and ranking co-occurrences in literature

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

Mind map of terminologies used in context of Generative AI

Kumud Singh

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Speck&Tech

ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune. Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile. BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

TrustArc Webinar - 2024 Global Privacy Survey

TrustArc

How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024? In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe. This webinar will review: - The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey - The top challenges for privacy leaders, practitioners, and organizations in 2024 - Key themes to consider in developing and maintaining your privacy program

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1

Microsoft - Power Platform_G.Aspiotis.pdf

Mind map of terminologies used in context of Generative AI

Securing your Kubernetes cluster_ a step-by-step guide to success !

RESUME BUILDER APPLICATION Project for students

Essentials of Automations: The Art of Triggers and Actions in FME

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Presentation of the OECD Artificial Intelligence Review of Germany

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

TrustArc Webinar - 2024 Global Privacy Survey

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Climate Impact of Software Testing at Nordic Testing Days

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Video Streaming: Then, Now, and in the Future

Monitoring Java Application Security with JDK Tools and JFR Events

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Integration of biomedical literature and databases

1. Integration of biomedical literature and databases Lars Juhl Jensen EMBL Heidelberg

2. biomedical databases

3. DNA sequences

4. GenBank

6. protein sequences

7. UniProt

9. protein structures

10. PDB

11.

12. expression

13. ArrayExpress

14. GEO Gene Expression Omnibus

15.

16. modifications

17. Phospho.ELM

18. PhosphoSite

19. interactions

20. BioGRID

21. DIP Database of Interacting Proteins

22. IntAct

23. MINT Molecular Interactions Database

24.

25. chemical compounds

26. PubChem

27.

28. database of databases

29. Duncan Hull, nodalpoint.org

30. freely available

31. literature mining

32. PubMed

33. exponential increase

34.

35.

36. some things never change

37.

38. “ graph calculus”

39. =

40. ~50 seconds per paper

41. information retrieval

42. find the relevant papers

43. ad hoc retrieval

44. user-specified query

45. “ yeast AND cell cycle”

46. stemming

47. yeast / yeasts

48. dynamic query expansion

49. yeast / S. cerevisiae

50.

51.

52.

53.

54. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

55. no tool will find it

56. entity recognition

57. identify the substance(s)

58. Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation

59. good synonyms list

60. orthographic variation

61. CDC28

62. Cdc28p

63. disambiguation

64. Cdc2

65. SDS

66. information extraction

67. formalize the facts

68. co-mentioning

69. NLP Natural Language Processing

70. Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation

71. integration tools

72. “ document-centric” tools

73. Reflect

74.

75. browser add-on

76. real-time tagging service

77. any HTML document

78. augmented document

79. information from databases

80.

81. iHOP

82.