Presentation for HCSNet Next Generation Workshop 2009
Based on our paper: Quantifying the impact of concept recognition on biomedical information retrieval
IP&M, 2009
This document discusses the challenges of integrating public domain drug discovery data from multiple sources and the mission of the Open PHACTS Foundation to address this issue. It outlines Open PHACTS' approach of integrating biomedical data resources into a single open access point using semantic web technologies and standards. This will allow users to perform queries across multiple data sources and analyze related data in an integrated manner. The document also notes some of the technical challenges around data integration due to differences in formats, identifiers and languages between sources.
Wimmics seminar--drug interaction knowledge base, micropublication, open anno...jodischneider
Presentation to the INRIA WIMMICS research group 2014-10-17 about our LISC paper: Using the micropublication ontology and the Open Annotation Data Model to represent evidence within a drug-drug interaction knowledge base:
http://jodischneider.com/pubs/lisc2014.pdf
http://wimmics.inria.fr/seminars
Stratergies for the intergration of information (IPI_ConfEX)Ben Gardner
The document discusses approaches to integrating internal and external data across pharmaceutical research. It describes utilizing a data warehousing strategy through a Research Information Factory (RIF) to create a single global repository for research data. However, integrating external data from various sources poses additional challenges. Tools like PharmaMatrix provide a pre-indexed mine of scientific literature linking drug targets to indications, but result sets can be large. The document suggests that Web 2.0 technologies like wikis, blogs and tagging could help turn integrated information into knowledge by enabling collaboration and sharing. Industry-wide data standards and common ontologies would also help facilitate external data integration.
Exploring Chemical and Biological Knowledge Spaces with PubChemPaul Thiessen
My presentation for the Drug Repurposing workshop at the upcoming Bio-IT World Expo.
http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=124256
Presentation abstract:
PubChem has a wealth of chemical structure and biological activity information. In conjunction with NCBI’s other resources such as PubMed and GenBank, PubChem is a vast source of information relevant to repurposing not only of established drugs but any compounds with in vivo pharmacology and/or clinical results. The challenge is how to take advantage of this knowledge. The ability to explore not only chemical similarity but relationships between diseases and disease targets has crucial value in repurposing. While focused investigations are already possible within the existing Entrez system, navigation across these linked information spaces can be difficult to do on a large scale with current tools. We are actively developing new infrastructure to support such analyses, and pursuing new methods of exploring inter- and intra-database relationships between chemicals, targets, diseases, and patents. Progress and some future direction in these areas will be presented.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
This document discusses data mining of radiology reports to structure unstructured text for further analysis. Over 500,000 de-identified radiology reports containing over 36 million words were annotated by experts to assign sentences to categories called propositions. So far over 427,000 unique sentences have been annotated, representing 60% of total sentences. The structured data is stored in a database and can be analyzed to find frequent findings and compare normal vs. abnormal results. Similar prior works are discussed but the large scale of this dataset and expert validation sets it apart.
The document discusses online resources that can support open drug discovery systems. It outlines how pharmaceutical companies spend billions annually on R&D and how public domain data from sources like literature, patents and databases could provide high value. However, such data is difficult to integrate and navigate due to a lack of standards and interoperability between sources. The Open PHACTS project aims to address this by developing standards to semantically integrate drug discovery data from public and private sources.
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...Remedy Informatics
The presentation describes how Remedy Informatics is advocating and innovating "flexible standardization" through an ontology-driven approach to clinical research. You will see in greater detail how a foundational, standardized Mosaic Ontology can be extended for more specific research applications and even more specific and focused disease research.
This document discusses the challenges of integrating public domain drug discovery data from multiple sources and the mission of the Open PHACTS Foundation to address this issue. It outlines Open PHACTS' approach of integrating biomedical data resources into a single open access point using semantic web technologies and standards. This will allow users to perform queries across multiple data sources and analyze related data in an integrated manner. The document also notes some of the technical challenges around data integration due to differences in formats, identifiers and languages between sources.
Wimmics seminar--drug interaction knowledge base, micropublication, open anno...jodischneider
Presentation to the INRIA WIMMICS research group 2014-10-17 about our LISC paper: Using the micropublication ontology and the Open Annotation Data Model to represent evidence within a drug-drug interaction knowledge base:
http://jodischneider.com/pubs/lisc2014.pdf
http://wimmics.inria.fr/seminars
Stratergies for the intergration of information (IPI_ConfEX)Ben Gardner
The document discusses approaches to integrating internal and external data across pharmaceutical research. It describes utilizing a data warehousing strategy through a Research Information Factory (RIF) to create a single global repository for research data. However, integrating external data from various sources poses additional challenges. Tools like PharmaMatrix provide a pre-indexed mine of scientific literature linking drug targets to indications, but result sets can be large. The document suggests that Web 2.0 technologies like wikis, blogs and tagging could help turn integrated information into knowledge by enabling collaboration and sharing. Industry-wide data standards and common ontologies would also help facilitate external data integration.
Exploring Chemical and Biological Knowledge Spaces with PubChemPaul Thiessen
My presentation for the Drug Repurposing workshop at the upcoming Bio-IT World Expo.
http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=124256
Presentation abstract:
PubChem has a wealth of chemical structure and biological activity information. In conjunction with NCBI’s other resources such as PubMed and GenBank, PubChem is a vast source of information relevant to repurposing not only of established drugs but any compounds with in vivo pharmacology and/or clinical results. The challenge is how to take advantage of this knowledge. The ability to explore not only chemical similarity but relationships between diseases and disease targets has crucial value in repurposing. While focused investigations are already possible within the existing Entrez system, navigation across these linked information spaces can be difficult to do on a large scale with current tools. We are actively developing new infrastructure to support such analyses, and pursuing new methods of exploring inter- and intra-database relationships between chemicals, targets, diseases, and patents. Progress and some future direction in these areas will be presented.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
This document discusses data mining of radiology reports to structure unstructured text for further analysis. Over 500,000 de-identified radiology reports containing over 36 million words were annotated by experts to assign sentences to categories called propositions. So far over 427,000 unique sentences have been annotated, representing 60% of total sentences. The structured data is stored in a database and can be analyzed to find frequent findings and compare normal vs. abnormal results. Similar prior works are discussed but the large scale of this dataset and expert validation sets it apart.
The document discusses online resources that can support open drug discovery systems. It outlines how pharmaceutical companies spend billions annually on R&D and how public domain data from sources like literature, patents and databases could provide high value. However, such data is difficult to integrate and navigate due to a lack of standards and interoperability between sources. The Open PHACTS project aims to address this by developing standards to semantically integrate drug discovery data from public and private sources.
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...Remedy Informatics
The presentation describes how Remedy Informatics is advocating and innovating "flexible standardization" through an ontology-driven approach to clinical research. You will see in greater detail how a foundational, standardized Mosaic Ontology can be extended for more specific research applications and even more specific and focused disease research.
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...Sarvnaz Karimi
The document presents new approaches for English-Persian transliteration and back-transliteration, including collapsed consonant and vowel models for segmentation and an alignment algorithm. Experimental results on two corpora show the new methods outperform baseline and other state-of-the-art methods, achieving a mean word accuracy of up to 72.2% for English to Persian transliteration and 59.8% for back-transliteration. The new segmentation and alignment techniques generate more accurate transformation rules between the language pairs.
This document describes a user study conducted to understand how medical experts search medical literature. 46 medical experts performed search tasks on 3 different systems using topics provided by a medical library. The systems varied from a basic Boolean system to one incorporating topic modeling. The topic modeling system was found to be the most difficult to use. Experts issued fewer queries for familiar topics and viewed fewer results. The study aims to help improve search tools for medical literature based on expert search behavior.
The document contains 12 monthly calendars for the year 2011, with each calendar spanning from Monday to Sunday over 4 weeks. The calendars were generated online and courtesy of http://www.incompetech.com.
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
This document discusses using automatic transliteration extraction to enrich transliteration lexicons. It proposes a method that uses an existing transliteration generation system to generate possible transliterations for out-of-dictionary words in a source language corpus. These transliterations are then matched to words in a comparable target language corpus to extract potential transliteration pairs. Experimental results on an English-Persian corpus show the method can accurately extract transliteration pairs, and that increasing the size of the initial seed transliteration lexicon improves performance. Further work is needed to incorporate named entity recognition.
We introduce CADEminer, a system that mines consumer reviews on medications in order to facilitate discovery of drug side effects that may not have been identified in clinical trials. CADEminer utilises search and natural language processing techniques to (a) extract mentions of side effects, and other relevant concepts such as drug names and diseases in reviews; (b) normalise the extracted mentions to their unified representation in ontologies such as SNOMED CT and MedDRA; (c) identify relationships between extracted concepts, such as a drug caused a side effect; (d) search in authoritative lists of known drug side effects to identify whether or not the extracted side effects are new and therefore require further investigation; and finally (e) provide statistics and visualisation of the data.
http://dl.acm.org/citation.cfm?id=2810143
Slides contain images that have been taken from the web.
Este decreto legislativo peruano establece un beneficio de recompensa para ciudadanos que brinden información que conduzca a la captura de miembros de organizaciones criminales o terroristas. Se crean dos comisiones evaluadoras para otorgar las recompensas y se establecen medidas de protección para los ciudadanos colaboradores. El financiamiento proviene de los presupuestos de los ministerios de Defensa e Interior.
Ajid abdulmazid 250 phs a escc ccb (investasi)rimmyzia
Dokumen tersebut merupakan ilustrasi manfaat asuransi PRUlink assurance account untuk AJID ABDULMAZID dengan usia 26 tahun. Ilustrasi ini menjelaskan rencana pembayaran premi selama 10 tahun, manfaat-manfaat asuransi yang ditawarkan seperti santunan rawat inap, manfaat cacat total dan permanen, manfaat tambahan untuk kondisi kritis, serta nilai tunai yang mungkin diterima pada usia tertentu.
This document provides tips for maintaining security and privacy online. It advises not sharing personal information or passwords with others, avoiding opening spam emails, and installing antivirus software to prevent viruses from downloading unauthorized internet programs.
ADCS 2014 Presentation for the paper: http://dl.acm.org/citation.cfm?id=2682868
"Extracting the geographical location that a tweet is about is crucial for many important applications ranging from disaster management to recommendation systems. We address the problem of finding the locational focus of tweets that is geographically identifiable on a map. Because of the short, noisy nature of tweets and inherent ambiguity of locations, tweet text alone cannot provide sufficient information for disambiguating the location mentions and inferring the actual location focus being referred to in a tweet. Therefore, we present a novel algorithm that identifies all location mentions from three information sources---tweet text, hashtags, and user profile---and then uses a gazetteer database to infer the most probable locational focus of a tweet. Our novel algorithm has the ability to infer a locational focus that may not be explicitly mentioned in the tweet and determine its most appropriate granularity, e.g., city or country."
1) 92% of recruiters use or plan to use social media for recruiting, with Facebook and LinkedIn being the most popular.
2) Most recruiters saw positive impacts from social recruiting including an increase in candidate quantity (49%) and quality (43%), and more employee referrals (31%).
3) Over 70% of recruiters consider themselves at least moderately skilled at social recruiting and have successfully hired candidates through social networks.
Dokumen tersebut membahas tentang HIV/AIDS dan infeksi menular seksual (IMS). Secara ringkas, dokumen menjelaskan tentang pengertian HIV dan AIDS, gejala awal hingga tahap akhir menjadi AIDS, cara penularan HIV dan IMS, serta akibat yang ditimbulkan oleh HIV dan IMS bagi kesehatan.
This document discusses the different types of pollution including air, water, soil, noise, light, thermal and radioactive pollution. It focuses on air pollution as the world's worst problem caused by sulfur, nitrogen and carbon compounds. Water pollution poses health concerns while soil pollution has effects on plants and wildlife as seen in the BP oil spill. Solutions proposed include using oil eating bacteria.
Réalisé par l'agence OhMyWeb: http://www.ohmyweb.fr/ dans le cadre d'une intervention au sein de l'Ecole Supérieure de Commerce de Pau.
Retrouvez nous également sur Facebook, Twitter, Google + et lisez notre blog: http://blog.ohmyweb.fr/
Au sommaire de la présentation:
Introduction e-marketing
Le web-marketing et ses nouveaux métiers
Sources de trafic
Mesure d’audience
The document discusses the exponential growth of biomedical research data and literature. It describes challenges researchers face in keeping up with the vast amount of information. Text mining techniques can help by automatically extracting relevant information and facts from literature and organizing them into structured knowledgebases. Named entity recognition is an important text mining task that involves identifying mentions of biomedical entities in text. Both rule-based and machine learning approaches have been used for named entity recognition.
The document discusses various methodologies for extracting information from biological literature, including entity recognition to identify genes/proteins mentioned in text, relationship extraction using co-occurrence and natural language processing techniques, and text categorization to identify specific relationship types. It provides examples of applying these methods to extract entity and relationship information from a sample sentence.
This document provides guidance on developing effective search strategies for systematic reviews. It emphasizes that comprehensive searches are needed but must balance precision and recall. It recommends using filters and hedges to improve precision while maximizing recall, as well as conducting multiple searches tailored to the questions. Pilot searches should be done to refine strategies. The document also stresses considering publication bias and searching additional sources to mitigate it.
TIGA: Target Illumination GWAS AnalyticsJeremy Yang
Aggregating and assessing experimental evidence for interpretable, explainable, accountable gene-trait associations. Presentation for NIH IDG Annual Meeting, Feb 9-11, 2021.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
This document discusses representing the provenance of microarray gene expression experiments in RDF. It outlines how microarray experiments work and the importance of provenance data for understanding results. It then describes a bottom-up approach to developing a provenance model for microarray data, including four types of provenance - institutional, experimental context, data analysis, and dataset descriptions. An example gene list and queries are provided. The model represents provenance as RDF to enable SPARQL querying. Future work includes integrating provenance from scientific workflows and spreadsheet results.
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...Sarvnaz Karimi
The document presents new approaches for English-Persian transliteration and back-transliteration, including collapsed consonant and vowel models for segmentation and an alignment algorithm. Experimental results on two corpora show the new methods outperform baseline and other state-of-the-art methods, achieving a mean word accuracy of up to 72.2% for English to Persian transliteration and 59.8% for back-transliteration. The new segmentation and alignment techniques generate more accurate transformation rules between the language pairs.
This document describes a user study conducted to understand how medical experts search medical literature. 46 medical experts performed search tasks on 3 different systems using topics provided by a medical library. The systems varied from a basic Boolean system to one incorporating topic modeling. The topic modeling system was found to be the most difficult to use. Experts issued fewer queries for familiar topics and viewed fewer results. The study aims to help improve search tools for medical literature based on expert search behavior.
The document contains 12 monthly calendars for the year 2011, with each calendar spanning from Monday to Sunday over 4 weeks. The calendars were generated online and courtesy of http://www.incompetech.com.
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
This document discusses using automatic transliteration extraction to enrich transliteration lexicons. It proposes a method that uses an existing transliteration generation system to generate possible transliterations for out-of-dictionary words in a source language corpus. These transliterations are then matched to words in a comparable target language corpus to extract potential transliteration pairs. Experimental results on an English-Persian corpus show the method can accurately extract transliteration pairs, and that increasing the size of the initial seed transliteration lexicon improves performance. Further work is needed to incorporate named entity recognition.
We introduce CADEminer, a system that mines consumer reviews on medications in order to facilitate discovery of drug side effects that may not have been identified in clinical trials. CADEminer utilises search and natural language processing techniques to (a) extract mentions of side effects, and other relevant concepts such as drug names and diseases in reviews; (b) normalise the extracted mentions to their unified representation in ontologies such as SNOMED CT and MedDRA; (c) identify relationships between extracted concepts, such as a drug caused a side effect; (d) search in authoritative lists of known drug side effects to identify whether or not the extracted side effects are new and therefore require further investigation; and finally (e) provide statistics and visualisation of the data.
http://dl.acm.org/citation.cfm?id=2810143
Slides contain images that have been taken from the web.
Este decreto legislativo peruano establece un beneficio de recompensa para ciudadanos que brinden información que conduzca a la captura de miembros de organizaciones criminales o terroristas. Se crean dos comisiones evaluadoras para otorgar las recompensas y se establecen medidas de protección para los ciudadanos colaboradores. El financiamiento proviene de los presupuestos de los ministerios de Defensa e Interior.
Ajid abdulmazid 250 phs a escc ccb (investasi)rimmyzia
Dokumen tersebut merupakan ilustrasi manfaat asuransi PRUlink assurance account untuk AJID ABDULMAZID dengan usia 26 tahun. Ilustrasi ini menjelaskan rencana pembayaran premi selama 10 tahun, manfaat-manfaat asuransi yang ditawarkan seperti santunan rawat inap, manfaat cacat total dan permanen, manfaat tambahan untuk kondisi kritis, serta nilai tunai yang mungkin diterima pada usia tertentu.
This document provides tips for maintaining security and privacy online. It advises not sharing personal information or passwords with others, avoiding opening spam emails, and installing antivirus software to prevent viruses from downloading unauthorized internet programs.
ADCS 2014 Presentation for the paper: http://dl.acm.org/citation.cfm?id=2682868
"Extracting the geographical location that a tweet is about is crucial for many important applications ranging from disaster management to recommendation systems. We address the problem of finding the locational focus of tweets that is geographically identifiable on a map. Because of the short, noisy nature of tweets and inherent ambiguity of locations, tweet text alone cannot provide sufficient information for disambiguating the location mentions and inferring the actual location focus being referred to in a tweet. Therefore, we present a novel algorithm that identifies all location mentions from three information sources---tweet text, hashtags, and user profile---and then uses a gazetteer database to infer the most probable locational focus of a tweet. Our novel algorithm has the ability to infer a locational focus that may not be explicitly mentioned in the tweet and determine its most appropriate granularity, e.g., city or country."
1) 92% of recruiters use or plan to use social media for recruiting, with Facebook and LinkedIn being the most popular.
2) Most recruiters saw positive impacts from social recruiting including an increase in candidate quantity (49%) and quality (43%), and more employee referrals (31%).
3) Over 70% of recruiters consider themselves at least moderately skilled at social recruiting and have successfully hired candidates through social networks.
Dokumen tersebut membahas tentang HIV/AIDS dan infeksi menular seksual (IMS). Secara ringkas, dokumen menjelaskan tentang pengertian HIV dan AIDS, gejala awal hingga tahap akhir menjadi AIDS, cara penularan HIV dan IMS, serta akibat yang ditimbulkan oleh HIV dan IMS bagi kesehatan.
This document discusses the different types of pollution including air, water, soil, noise, light, thermal and radioactive pollution. It focuses on air pollution as the world's worst problem caused by sulfur, nitrogen and carbon compounds. Water pollution poses health concerns while soil pollution has effects on plants and wildlife as seen in the BP oil spill. Solutions proposed include using oil eating bacteria.
Réalisé par l'agence OhMyWeb: http://www.ohmyweb.fr/ dans le cadre d'une intervention au sein de l'Ecole Supérieure de Commerce de Pau.
Retrouvez nous également sur Facebook, Twitter, Google + et lisez notre blog: http://blog.ohmyweb.fr/
Au sommaire de la présentation:
Introduction e-marketing
Le web-marketing et ses nouveaux métiers
Sources de trafic
Mesure d’audience
The document discusses the exponential growth of biomedical research data and literature. It describes challenges researchers face in keeping up with the vast amount of information. Text mining techniques can help by automatically extracting relevant information and facts from literature and organizing them into structured knowledgebases. Named entity recognition is an important text mining task that involves identifying mentions of biomedical entities in text. Both rule-based and machine learning approaches have been used for named entity recognition.
The document discusses various methodologies for extracting information from biological literature, including entity recognition to identify genes/proteins mentioned in text, relationship extraction using co-occurrence and natural language processing techniques, and text categorization to identify specific relationship types. It provides examples of applying these methods to extract entity and relationship information from a sample sentence.
This document provides guidance on developing effective search strategies for systematic reviews. It emphasizes that comprehensive searches are needed but must balance precision and recall. It recommends using filters and hedges to improve precision while maximizing recall, as well as conducting multiple searches tailored to the questions. Pilot searches should be done to refine strategies. The document also stresses considering publication bias and searching additional sources to mitigate it.
TIGA: Target Illumination GWAS AnalyticsJeremy Yang
Aggregating and assessing experimental evidence for interpretable, explainable, accountable gene-trait associations. Presentation for NIH IDG Annual Meeting, Feb 9-11, 2021.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
This document discusses representing the provenance of microarray gene expression experiments in RDF. It outlines how microarray experiments work and the importance of provenance data for understanding results. It then describes a bottom-up approach to developing a provenance model for microarray data, including four types of provenance - institutional, experimental context, data analysis, and dataset descriptions. An example gene list and queries are provided. The model represents provenance as RDF to enable SPARQL querying. Future work includes integrating provenance from scientific workflows and spreadsheet results.
Biological literature mining - from information retrieval to biological disco...Lars Juhl Jensen
14th International Conference on Intelligent Systems for Molecular Biology, Tutorial, Fortaleza Conference Center, Fortaleza, Brazil, August 6-10, 2006
Visual Analytics and the Language of Web Query Logs - A Terminology PerspectiveFindwise
This paper explores means to integrate natural language processing methods for terminology and entity identification in medical web session logs with visual analytics techniques. The aim of the study is to examine whether the vocabulary used in queries posted to a Swedish regional health web site can be assessed in a way that will enable a terminologist or medical data analysts to instantly identify new term candidates and their relations based on significant co-occurrence patterns. We provide an example application in order to illustrate how the co-occurrence relationships between medical and general entities occurring in such logs can be visualized, accessed and explored. To enable a visual exploration of the generated co-occurrence graphs, we employ a general purpose social network analysis tool, http://visone.info), that permits to visualize and analyze various types of graph structures. Our examples show that visual analytics based on co-occurrence analysis provides insights into the use of layman language in relation to established (professional) terminologies, which may help terminologists decide which terms to include in future terminologies. Increased understanding of the used querying language is also of interest in the context of public health web sites. The query results should reflect the intentions of the information seekers, who may express themselves in layman language that differs from the one used on the available web sites provided by medical professionals.
The document describes a framework for biological relation extraction using biomedical ontologies and text mining. It discusses introducing biomedical text mining and outlines the problem, motivation, and challenges. It then presents the overall system components and architecture, including searching/browsing, Swanson's algorithm, protein-protein interactions, and gene clustering applications. The framework concept issues, design issues, sequence diagram, and database are also covered at a high level.
This document discusses tools for integrating text, data, and various types of evidence to build association networks among proteins, chemicals, and other molecules. It focuses on STRING and STITCH, which aggregate data from curated databases, text mining, and predictions to assign confidence scores and build interactive networks showing functional associations. The exercises guide exploring these resources to learn more about the human thymidylate synthase protein and its interactions.
Semantic Web Technologies as a Framework for Clinical InformaticsChimezie Ogbuji
1. The document discusses using semantic web technologies like RDF and SPARQL to store and query patient data from electronic health records to enable cohort identification and clinical research.
2. Each patient record is stored as a named graph in an RDF dataset that can then be queried using SPARQL. This allows parallel processing and optimizing queries to search within patient graphs.
3. Challenges include limitations of SPARQL for representing complex queries and a lack of standard medical ontologies, leading the authors to develop their own patient record ontology.
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
This document discusses using natural language processing (NLP) techniques to extract biological information from literature to help interpret large genomics datasets. The author describes developing a method to identify gene regulatory interactions by parsing Medline abstracts. This information can then be combined with data from experiments to classify protein associations and interactions. While literature provides important context, it should not be used alone. The author also intends to apply these NLP methods to full text articles to extract information from different sections like introductions and discussions.
Richard Resnick, CEO of II-SDV, discussed the challenges of keyword and biological sequence searching in life sciences. Integrating the two search types can provide more relevant results from broad patent authority coverage. However, keyword searching in life sciences is challenging due to issues like spelling variations and domain-specific terminology. Sequence searching also presents challenges regarding alignment and interpretation of results. Building reports from different search platforms is further challenging due to inconsistencies in output formats and a lack of cross-platform integration. An integrated life science search platform could provide a complete report for analysis by combining text and sequence search results into a single workfile.
Ontologies for Semantic Normalization of Immunological DataYannick Pouliot
This document discusses using ontologies to semantically normalize immunological data from the Human Immune Profiling Consortium (HIPC). 57 ontologies covering domains like anatomy, disease, pathways were evaluated. Text from HIPC datasets and protocols was annotated using these ontologies, with the NCI Thesaurus, Medical Subject Headings, and Gene Ontology mapping to the most terms. Many failures were due to missing commercial reagent terms. The conclusions are that ImmPort, the HIPC data repository, could adopt ontology-based encoding with additions to ontologies and text pre-processing.
The TDR Targets Database, an introduction.tdrtargets
The TDR Targets Database is an online resource that integrates genomic information relevant for drug discovery on pathogens that cause human diseases. It covers organisms like Plasmodium falciparum (malaria), Mycobacterium tuberculosis (tuberculosis), Trypanosoma brucei (African trypanosomiasis), and others. The database allows users to search for drug targets in these pathogens based on criteria like functional category, phylogenetic distribution, and essentiality. It provides detailed information on resulting target genes, including pathways, structural models, orthologs, essentiality data, druggability, and associated compounds.
This document describes PhenDisco, a rule-based natural language processing (NLP) system for standardizing phenotype variables in the database of Genotypes and Phenotypes (dbGaP). The system uses a pipeline that tags variables with topics and subjects, normalizes descriptions, assigns semantic roles, categorizes variables, and maps them to standardized ontologies. It achieved an accuracy of 73% for topic identification and 71% for categorization. PhenDisco provides improved search and access to dbGaP compared to the existing dbGaP Entrez system, demonstrating higher recall and precision. Future work will integrate machine learning and identify similar variables to further enhance phenotype discovery in dbGaP.
This document provides an overview of meta-analysis methodology and guidelines. It discusses 10 rules for conducting meta-analyses, including specifying the topic, following reporting guidelines, establishing inclusion criteria, conducting a systematic search, contacting authors for missing data, selecting statistical models, using software, being transparent in reporting, providing enough data in manuscripts, and discussing findings and future directions. It also lists several software programs that can be used to perform meta-analyses and statistical analyses. Finally, it provides examples of highly cited meta-analyses conducted by the author.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
1. Sarvnaz Karimi
BioTALA group, NICTA Victoria Research Laboratory
Tagging and Genomics Information Retrieval
2. Genomics Information Retrieval
TREC genomics defined a TEXT
retrieval task in which a user
seeks infomration in a sub-area of
biology linked with genomics
information.
3. Task Definition
TREC 2006 and 2007 Genomics:
Passage retrieval
Collection was full-text articles
(12GB) from 49 medical journals.
Queries were medical questions.
Evaluation based on passages,
documents, and aspect level.
4. Queries
Questions were based on the Generic Topic Types (GTTs)
2006: Five different GTTS, such as “Find articles describing the role of
a gene involved in a given disease.”
e.g. What is the role of DRD4 in alcoholism?
2007: Questions asking for a specific entity type based on controlled
terminologies (14 types).
e.g. What [DISEASES] are associated with lysosomal abnormalities in the
nervous system?
Relevance Judgements: Human experts idetified the relevant
passages, and the relevant concepts.
5. Can Tagging Entity Types Help Retrieval?
Tagging: Manual or automatic association of a tag from a controlled
vocabulary to a term or phrase in a text.
Controlled vocabulary: set of entity types as defined for the TREC
tasks.
e.g. What is the role of DRD4 in alcoholism?
What is the role of DRD4 [GENE] in alcoholism [DISEASE]?
Possible benefits:
disambiguation: e.g. nur-77 can refer to a gene or protein.
Increase the chance of retrieving a document by increasing the
common terms it has with the query.
6. Inside the Text Collection
Distribution of tag terms (entity type) in the
documents:
Large majority of documents contain more than
one distinct tag (57.4% for 2006, and 94.6% for
2007)
for the 2006 collection, on average 46.5% of the
irrelevant documents contain the tag terms,
compared to 63.4% of the relevant documents.
For 2007, these numbers are 47.8% and 80.3%,
respectively.
the proportion of relevant documents that would
not be retrieved without tagging. For the 2006
collection, this holds for only 0.4% of relevant
documents; for 2007, it is only 0.005%.
7. Inside the Text Collection: Conclusion 1
• Tag terms occur somewhat more frequently in relevant
documents compared to irrelevant documents (No extra
information/disambiguation is expected with tagging).
• In case of no annotation, the relevant documents would
nearly all still be retrievable because of other term overlap
between the queries and documents.
8. Inside the Text Collection: IR Flavoured
Recall versus documents sorted based on the descending frequency of the
tag words:
9. Inside the Text Collection: Conclusion 2
• A small correlation (ρ=0.09) between the number of distinct
tags and the likelihood that a document is relevant.
• Tag frequency appears to be related to relevance (ρ=0.84) for
most tags.
11. What's happening?
• An example quer y:
– Q1. What serum [PROTEINS] change expression in association with high
disease activity in lupus? (original query)
– Q2. What serum change expression in association with high disease
activity in lupus
– Q3. proteins
– Q4. lupus
13. Conclusions
Does tagging help a text retrieval task? We still dont have a
strong evidence to prove it does.
Maybe tags are too general to be discriminative enough.
What level of accuracy a tagger should have to be beneficial?
In the explained framework we could not define a threshold. Even
randon assignment of tags improved MAP over the baseline.