The biomedical literature captures the most current biomedical knowledge and is a tremendously rich resource for research. With over 24 million publications currently indexed in the US National Library of Medicine’s PubMed index, however, it is becoming increasingly challenging for biomedical researchers to keep up with this literature. Automated strategies for extracting information from it are required. Large-scale processing of the literature enables direct biomedical knowledge discovery. In this presentation, I will introduce the use of text mining techniques to support analysis of biological data sets, and will specifically discuss applications in protein function and phenotype prediction, exploring the integration of literature data with complementary structured resources.
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
This is the webinar presented on the 14th April as part of the Ensembl Online Webinar series. You can view the recorded webinar on the Ensembl Helpdesk youtube channel https://www.youtube.com/watch?v=blbhuqiiDoA
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
This is the webinar presented on the 14th April as part of the Ensembl Online Webinar series. You can view the recorded webinar on the Ensembl Helpdesk youtube channel https://www.youtube.com/watch?v=blbhuqiiDoA
NCBI has developed a powerful suite of online biomedical and bioinformatics resources, including old friends like PubMed and OMIM and newer resources such as Genome. This collection of databases and tools are widely used by scientists and medical professionals across the world. With such a wealth of information, it is easy to get overwhelmed. Join us for an overview to NCBI resources for the information professional with an emphasis on biodata connectivity. No science degree required!
*Watch the video at the end of the presentation
Seminar led by Dr. Xavier de la Cruz, ICREA Research Professor. Head of the Translational Bioinformatics in Neuroscience group of VHIR, at VHIR (22nd November 2012).
Content: The need to identify the pathological character of mutations may arise in different contexts in biomedical research. However, the methods available to address this problem essentially depend on the number of cases under analysis. When we work with only a few mutations we can use an artisan-like approach, where all information available on protein sequence, structure and function is manually retrieved and studied. However, when we need to characterize many variants, as can be the case in exome projects, faster methods are required to assess their pathogenicity. In my talk I will illustrate the principles underlying these two approaches with examples from the study of Fabry disease mutations, resulting from our collaborative work at the VHIR.
DextMP: Text mining for finding moonlighting proteinsPurdue University
Slides presented at ISMB 2017 in Prague on "DextMP: deep dive in text for predicting moonlighting proteins" by Ishita K. Khan, Mansurul Bhuiiyan, &. Daisuke Kihara. ISMB Proceeding talk, published on Bioinformatics: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx231
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
Overview of collaborative projects in the life sciences building out the necessary ontologies, schemas, and knowledge graphs for describing biological knowledge
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project on Eurytemora affinis
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
A Workshop at the Stowers Institute for Medical Research.
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
NCBI has developed a powerful suite of online biomedical and bioinformatics resources, including old friends like PubMed and OMIM and newer resources such as Genome. This collection of databases and tools are widely used by scientists and medical professionals across the world. With such a wealth of information, it is easy to get overwhelmed. Join us for an overview to NCBI resources for the information professional with an emphasis on biodata connectivity. No science degree required!
*Watch the video at the end of the presentation
Seminar led by Dr. Xavier de la Cruz, ICREA Research Professor. Head of the Translational Bioinformatics in Neuroscience group of VHIR, at VHIR (22nd November 2012).
Content: The need to identify the pathological character of mutations may arise in different contexts in biomedical research. However, the methods available to address this problem essentially depend on the number of cases under analysis. When we work with only a few mutations we can use an artisan-like approach, where all information available on protein sequence, structure and function is manually retrieved and studied. However, when we need to characterize many variants, as can be the case in exome projects, faster methods are required to assess their pathogenicity. In my talk I will illustrate the principles underlying these two approaches with examples from the study of Fabry disease mutations, resulting from our collaborative work at the VHIR.
DextMP: Text mining for finding moonlighting proteinsPurdue University
Slides presented at ISMB 2017 in Prague on "DextMP: deep dive in text for predicting moonlighting proteins" by Ishita K. Khan, Mansurul Bhuiiyan, &. Daisuke Kihara. ISMB Proceeding talk, published on Bioinformatics: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx231
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
Overview of collaborative projects in the life sciences building out the necessary ontologies, schemas, and knowledge graphs for describing biological knowledge
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project on Eurytemora affinis
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
A Workshop at the Stowers Institute for Medical Research.
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
Using real-world evidence to investigate clinical research questionsKarin Verspoor
Adoption of electronic health records to document extensive clinical information brings with it the opportunity to utilise that information to support clinical research, and ultimately to support clinical decision making. In this talk, I discuss both these opportunities and the challenges that we face when working with real-world clinical data, and introduce some of the strategies that we are adopting to make this data more usable, and to extract more value from it. I specifically discuss the use of natural language processing to transform clinical documentation into structured data for this purpose.
Machine learning -- the use of computational algorithms to find patterns in data -- is increasingly being deployed in clinical contexts to support diagnosis and treatment decisions. In the context of growing volumes of clinical data available in electronic form, there is an opportunity to realise dramatic changes in the practice of medicine through the application of large-scale health data analytics and predictive modeling. This talk will introduce a vision for the use of data-driven methods in health, while also raising important questions about the implementation of this vision: is it conceivable that one day your doctor might be replaced by a digital system? What are the risks?
Using text mining to inform genetic variant interpretationKarin Verspoor
There are ongoing large-scale efforts to catalog genomic variation related to disease in structured databases. Much of the relevant information is available only from unstructured sources, including the scientific literature. In our work, we have explored the ability of text mining tools to recover the mutations catalogued in curated databases based on the article text, specifically examining the recovery of mutations in the COSMIC and InSiGHT databases. We demonstrate that there are excellent tools for extraction of mutation mentions from the literature, but that the recovery of the information in databases is far less than what would be expected based on that tool performance, even when full text articles are available. I will present an analysis in which we explore the impact of processing tables and supplementary material associated to relevant literature, demonstrating that the coverage of variants improves dramatically, from 2% to over 50%. I will further present the Variome corpus, a small collection of full text publications annotated with relationships such as gene-disease and mutation-disease relationships, and introduce our recent efforts to develop strategies to extract this relational information from the literature. Joint work with Antonio Jimeno Yepes (IBM Research) and Min Song (Yonsei University).
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Karin Verspoor
Human-generated text is a critical component of recorded clinical data, yet remains an under-utilised resource in clinical informatics applications due to minimal standards for sharing of unstructured data as well as concerns about patient privacy. Where we can access and analyse clinical text, we find that it provides a hugely valuable resource. In this talk, I will describe two projects where we have used text classification as the basis for addressing a clinical objective: (1) a syndromic surveillance project where the task is the monitoring of health and social media data sources for changes that indicate the onset of disease outbreaks, and (2) the analysis of hospital records to enable retrieval of specific disease cases, for monitoring of the hospital case mix as well as for construction of patient cohorts for clinical research studies. I will end by briefly discussing the huge potential for clinical text analysis to support changing the way modern medicine is practised.
Syndromic Surveillance from Emergency Department Triage NotesKarin Verspoor
Background
Syndromic surveillance refers to reporting and tracking of reportable and unusual diseases to public health officials. Conventional surveillance strategies are often manual, or depend on confirmatory laboratory testing after a disease diagnosis. These traditional strategies often result in relatively late detection of an outbreak or public health emergency. Strategies for reliably accelerating surveillance are under active research.
The aim of our work is detection of specific syndromes in individual patient triage records in the hospital Emergency Department (ED). We focus on analysing the free text clinical notes written by a triage nurse during a brief pre-diagnostic assessment of a patient upon arrival in the ED. The system can detect patients that appear to have a disease of interest.
Methods
We work with a set of over 310,000 records collected in two Victorian EDs over a several-year period. Each patient triage record in our data includes (1) a free text note and (2) a diagnostic code from the International Classification of Disease (ICD-10) that was assigned after the fact. This data was used for training and testing of various classifiers, in a cross-validation scenario. We experimented with a range of different set-ups, including attempting direct prediction of ICD-10 codes for a given triage note, as well as prediction of “syndromes” defined by a specific set of ICD-10 codes. We also experimented with several different feature representations and machine learning models.
Results
In general, the performance of the models for syndromes was better than for direct ICD-10 category classification, suggesting that the syndrome definitions are clinically coherent. We observed substantial variation in performance across the various syndromes; several syndromes had too few examples in the dataset to build an effective classifier. The best performance on these tasks used a machine learning model that incorporates pre-processing of the texts to identify direct mentions of ICD-10 and SNOMED CT terms.
Conclusion
We have demonstrated that it is possible to build an effective syndrome detection tool for ED triage notes, where there is adequate and reliable training data available for a given syndrome of interest. We have shown that semantic abstraction of the text into “medical concept space” is of benefit for this task.
Topic modeling of Emergency Department Triage notes for characterising pain-r...Karin Verspoor
Background
Pain is a feature of approximately 70% of all Emergency Department (ED) presentations. It has been demonstrated that mandating recording of a patient’s feeling of pain can improve service delivery for ED patients. However, there is a substantial group of patients (approximately 21% of ED visits in our 12-month sample) for which there exists an inconsistency between pain score and the Australian Triage Scale (ATS) score assigned by the nurse; where a patient reports high levels of pain but they are assigned a lower-urgency triage category. It has been unclear until now whether this “inconsistent” group of patients has been receiving optimal care.
Methods
To better understand the characteristics in this inconsistent group, we performed topic modeling of the clinical notes collected during ED triage assessments. We divided the notes into two subgroups, according to whether or not the patient’s self-reported level of pain was consistent with the triage urgency recorded in the ATS score. We performed topic modeling of these two subgroups separately, using the implementation of Latent Dirichlet Allocation (LDA) in the Mallet toolkit. We have experimented with several representations of the notes, including unigrams (tokens), bigrams, and the medical concepts contained in each note, as determined with the MetaMap medical concept recognition tool. An ED nurse reviewed the topics generated in each case and assigned a descriptor to them.
Results
When considering the token-based presentation of the notes, the labels in the consistent group are related to road trauma, cardiac pain, change of consciousness, ongoing chest pain, limb injury, renal illness and pain due to illness. In the inconsistent group, we find topics related to either conditions related to ongoing conditions (including postoperative complications or worsening abdominal pain), urinary and respiratory problems, infections and injury related complications.
When considering the concept-based representation of the notes, the labels in the consistent set denote gastrointestinal diseases, neurological illness, dizziness, chest pain, testicular pain, shortness of breath and trauma. The labels in the inconsistent set denote different issues caused by trauma and distress due to pain, infection and urinary condition. This includes injuries in several body parts like in the limbs and back. The latter topic containing body parts appears to have been enabled by the abstraction of individual terms into concepts.
Conclusions
Topic modeling of Emergency Department data shows substantial promise for helping to characterise particular subpopulations of interest, and incorporating pre-processing of clinical notes to capture variation in clinical terminology appears to have value. While this initial work has focused on the pain-related chief complaints, we have also recently begun to explore temporal characteristics of the data through analysis of how derived topics change ove
Basavarajeeyam is a Sreshta Sangraha grantha (Compiled book ), written by Neelkanta kotturu Basavaraja Virachita. It contains 25 Prakaranas, First 24 Chapters related to Rogas& 25th to Rasadravyas.
Flu Vaccine Alert in Bangalore Karnatakaaddon Scans
As flu season approaches, health officials in Bangalore, Karnataka, are urging residents to get their flu vaccinations. The seasonal flu, while common, can lead to severe health complications, particularly for vulnerable populations such as young children, the elderly, and those with underlying health conditions.
Dr. Vidisha Kumari, a leading epidemiologist in Bangalore, emphasizes the importance of getting vaccinated. "The flu vaccine is our best defense against the influenza virus. It not only protects individuals but also helps prevent the spread of the virus in our communities," he says.
This year, the flu season is expected to coincide with a potential increase in other respiratory illnesses. The Karnataka Health Department has launched an awareness campaign highlighting the significance of flu vaccinations. They have set up multiple vaccination centers across Bangalore, making it convenient for residents to receive their shots.
To encourage widespread vaccination, the government is also collaborating with local schools, workplaces, and community centers to facilitate vaccination drives. Special attention is being given to ensuring that the vaccine is accessible to all, including marginalized communities who may have limited access to healthcare.
Residents are reminded that the flu vaccine is safe and effective. Common side effects are mild and may include soreness at the injection site, mild fever, or muscle aches. These side effects are generally short-lived and far less severe than the flu itself.
Healthcare providers are also stressing the importance of continuing COVID-19 precautions. Wearing masks, practicing good hand hygiene, and maintaining social distancing are still crucial, especially in crowded places.
Protect yourself and your loved ones by getting vaccinated. Together, we can help keep Bangalore healthy and safe this flu season. For more information on vaccination centers and schedules, residents can visit the Karnataka Health Department’s official website or follow their social media pages.
Stay informed, stay safe, and get your flu shot today!
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...kevinkariuki227
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Verified Chapters 1 - 19, Complete Newest Version.pdf
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Verified Chapters 1 - 19, Complete Newest Version.pdf
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journeygreendigital
Tom Selleck, an enduring figure in Hollywood. has captivated audiences for decades with his rugged charm, iconic moustache. and memorable roles in television and film. From his breakout role as Thomas Magnum in Magnum P.I. to his current portrayal of Frank Reagan in Blue Bloods. Selleck's career has spanned over 50 years. But beyond his professional achievements. fans have often been curious about Tom Selleck Health. especially as he has aged in the public eye.
Follow us on: Pinterest
Introduction
Many have been interested in Tom Selleck health. not only because of his enduring presence on screen but also because of the challenges. and lifestyle choices he has faced and made over the years. This article delves into the various aspects of Tom Selleck health. exploring his fitness regimen, diet, mental health. and the challenges he has encountered as he ages. We'll look at how he maintains his well-being. the health issues he has faced, and his approach to ageing .
Early Life and Career
Childhood and Athletic Beginnings
Tom Selleck was born on January 29, 1945, in Detroit, Michigan, and grew up in Sherman Oaks, California. From an early age, he was involved in sports, particularly basketball. which played a significant role in his physical development. His athletic pursuits continued into college. where he attended the University of Southern California (USC) on a basketball scholarship. This early involvement in sports laid a strong foundation for his physical health and disciplined lifestyle.
Transition to Acting
Selleck's transition from an athlete to an actor came with its physical demands. His first significant role in "Magnum P.I." required him to perform various stunts and maintain a fit appearance. This role, which he played from 1980 to 1988. necessitated a rigorous fitness routine to meet the show's demands. setting the stage for his long-term commitment to health and wellness.
Fitness Regimen
Workout Routine
Tom Selleck health and fitness regimen has evolved. adapting to his changing roles and age. During his "Magnum, P.I." days. Selleck's workouts were intense and focused on building and maintaining muscle mass. His routine included weightlifting, cardiovascular exercises. and specific training for the stunts he performed on the show.
Selleck adjusted his fitness routine as he aged to suit his body's needs. Today, his workouts focus on maintaining flexibility, strength, and cardiovascular health. He incorporates low-impact exercises such as swimming, walking, and light weightlifting. This balanced approach helps him stay fit without putting undue strain on his joints and muscles.
Importance of Flexibility and Mobility
In recent years, Selleck has emphasized the importance of flexibility and mobility in his fitness regimen. Understanding the natural decline in muscle mass and joint flexibility with age. he includes stretching and yoga in his routine. These practices help prevent injuries, improve posture, and maintain mobilit
- Video recording of this lecture in English language: https://youtu.be/lK81BzxMqdo
- Video recording of this lecture in Arabic language: https://youtu.be/Ve4P0COk9OI
- Link to download the book free: https://nephrotube.blogspot.com/p/nephrotube-nephrology-books.html
- Link to NephroTube website: www.NephroTube.com
- Link to NephroTube social media accounts: https://nephrotube.blogspot.com/p/join-nephrotube-on-social-media.html
CDSCO and Phamacovigilance {Regulatory body in India}NEHA GUPTA
The Central Drugs Standard Control Organization (CDSCO) is India's national regulatory body for pharmaceuticals and medical devices. Operating under the Directorate General of Health Services, Ministry of Health & Family Welfare, Government of India, the CDSCO is responsible for approving new drugs, conducting clinical trials, setting standards for drugs, controlling the quality of imported drugs, and coordinating the activities of State Drug Control Organizations by providing expert advice.
Pharmacovigilance, on the other hand, is the science and activities related to the detection, assessment, understanding, and prevention of adverse effects or any other drug-related problems. The primary aim of pharmacovigilance is to ensure the safety and efficacy of medicines, thereby protecting public health.
In India, pharmacovigilance activities are monitored by the Pharmacovigilance Programme of India (PvPI), which works closely with CDSCO to collect, analyze, and act upon data regarding adverse drug reactions (ADRs). Together, they play a critical role in ensuring that the benefits of drugs outweigh their risks, maintaining high standards of patient safety, and promoting the rational use of medicines.
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...Oleg Kshivets
RESULTS: Overall life span (LS) was 2252.1±1742.5 days and cumulative 5-year survival (5YS) reached 73.2%, 10 years – 64.8%, 20 years – 42.5%. 513 LCP lived more than 5 years (LS=3124.6±1525.6 days), 148 LCP – more than 10 years (LS=5054.4±1504.1 days).199 LCP died because of LC (LS=562.7±374.5 days). 5YS of LCP after bi/lobectomies was significantly superior in comparison with LCP after pneumonectomies (78.1% vs.63.7%, P=0.00001 by log-rank test). AT significantly improved 5YS (66.3% vs. 34.8%) (P=0.00000 by log-rank test) only for LCP with N1-2. Cox modeling displayed that 5YS of LCP significantly depended on: phase transition (PT) early-invasive LC in terms of synergetics, PT N0—N12, cell ratio factors (ratio between cancer cells- CC and blood cells subpopulations), G1-3, histology, glucose, AT, blood cell circuit, prothrombin index, heparin tolerance, recalcification time (P=0.000-0.038). Neural networks, genetic algorithm selection and bootstrap simulation revealed relationships between 5YS and PT early-invasive LC (rank=1), PT N0—N12 (rank=2), thrombocytes/CC (3), erythrocytes/CC (4), eosinophils/CC (5), healthy cells/CC (6), lymphocytes/CC (7), segmented neutrophils/CC (8), stick neutrophils/CC (9), monocytes/CC (10); leucocytes/CC (11). Correct prediction of 5YS was 100% by neural networks computing (area under ROC curve=1.0; error=0.0).
CONCLUSIONS: 5YS of LCP after radical procedures significantly depended on: 1) PT early-invasive cancer; 2) PT N0--N12; 3) cell ratio factors; 4) blood cell circuit; 5) biochemical factors; 6) hemostasis system; 7) AT; 8) LC characteristics; 9) LC cell dynamics; 10) surgery type: lobectomy/pneumonectomy; 11) anthropometric data. Optimal diagnosis and treatment strategies for LC are: 1) screening and early detection of LC; 2) availability of experienced thoracic surgeons because of complexity of radical procedures; 3) aggressive en block surgery and adequate lymph node dissection for completeness; 4) precise prediction; 5) adjuvant chemoimmunoradiotherapy for LCP with unfavorable prognosis.
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
Function and Phenotype Prediction through Data and Knowledge Fusion
1. Function and Phenotype Prediction
through Data and Knowledge Fusion
Karin M. Verspoor, The University of Melbourne
karin.verspoor@unimelb.edu.au
27 January 2016 – King Abdullah University of Science and
Technology, Computational Bioscience Research Center
2. We have the blueprints to life,
but we don’t know how to read them.
• At least a quarter of protein families in
PFAM have no known function
(Domains of Unknown Function)
• Millions of proteins uncharacterised
4. What is protein function?
• Captures biological
process, molecular
function, cellular
component
• Common
representation for
Model organism
databases to
facilitate sharing
The Gene Ontology (GO) provides a vocabulary
7. Exponential knowledge growth
• ~1550 peer-reviewed gene-
related databases in NAR
online Mol Bio collection
• Over 25 million PubMed
entries (> 2,000/day)
• Breakdown of disciplinary
boundaries makes more of it
relevant to each of us
• “Like drinking from a firehose”
– Jim Ostell (NCBI IEB Chief)
8. Text as a primary source of knowledge
Despite ever increasing structured resources,
the literature remains the primary repository of
knowledge in biomedicine
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
#Swiss-ProtProteins
Proteins missing a FUNCTION comment
Proteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”
Baumgartner et al Bioinformatics (ISMB 2007)
10. Data sources, Data Integration
• Structured Resources
– Largely manually ‘curated’, high quality
– Often unannotated
– Organizes targeted information
– Computable
• Unstructured Resources
– Literature: peer reviewed, well-formed
– Natural Language: ambiguity, complexity
– Broad, current coverage of biological knowledge
– Intended for Human communication
11. Bio Text Analysis in a nutshell
Input
Documents
pre-processing
(e.g., format conversion)
tokenization
sentence detection
term normalisation
(e.g., stem, lemmatise)
biological named entity
recognition
biological concept
recognition
syntactic analysis
coordination resolution
co-reference resolution
ambiguity resolution
entity linking
Domain Knowledge:
Terminologies
Ontologies
Known Relationships
relation extraction
event annotation
reasoning and inference
Annotated
Documents
extracted facts
and
relationships
12. GO Function Prediction
Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.
Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.
13. GOstruct: Structured output SVM
cross-
species view
species-
specific view
Sequence- based
features
f eat ur es
mouse GO
annotations
l abel s
human GO
annotations
mouse
annotati
l abel
co- mention
f eat ur es
PPI
gene expression
st r uct ur ed
SVM t r ai ni ng
st r uct ur ed
SVM t r ai ni ng
f(x,y) = f(c)
(x,y) + f(s)
(x,y)
mul t i - vi ew
14. Structured output
• Represent a set of annotations as a single vector
• Encodes the hierarchical structure from annotation to
root
16. Feature integration via kernels
• Cross-species (sequence-based) features
– e-values from significant BLAST hits
– features from WoLF PSORT protein localization software
– transmembrane protein prediction using TMHMM
– k-mer composition of the N and C termini
– low complexity regions
• Species-specific features
– Protein interactions
– Gene Expression
– Phylogenetic profiles
– Text-derived features
17. Extraction & Analysis pipeline
Christopher Funk (2015) PhD dissertation, U. Colorado Denver
18. Integrating Text
• Protein – Gene Ontology term co-occurrence
• Protein – Protein co-occurrence
Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10
19. Text-based features
• Words
– (tokens)
• Entities or Concepts
– (gene/protein mentions)
– (gene ontology concepts)
• Relations
– (simple co-occurrences)
23. Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:
sent_comen(P50281, GO:0008237)
sent_comen(P50281, GO:0006508)
sent_comen(P50281, GO:0009056)
sent_comen(P50281, GO:0031012)
nonSent_comen(P50281, GO:0010467)
nonSent_comen(P50281, GO:0005623)
24. Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:
nonSent_comen(P50281, GO:0008237)
nonSent_comen(P50281, GO:0006508)
nonSent_comen(P50281, GO:0009056)
nonSent_comen(P50281, GO:0031012)
nonSent_comen(P50281, GO:0010467)
nonSent_comen(P50281, GO:0005623)
25. Feature Representation
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of Words:
P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …
Protein GO term co-mentions (sentence):
P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1
Protein GO term co-mentions (non-sentence):
P40281, GO:0010467=2, GO:0005623=2
26. Feature Representation
Bag of Words:
UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
UniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
…
UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
Protein GO term co-mentions (sentence):
UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
UniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
…
UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
Protein GO term co-mentions (non-sentence):
UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
UniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
…
UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
27. An aside on GO concept recognition
• Given:
– Gene Ontology (~46,000 concepts)
In mice lacking ephrin-A5 function, cell proliferation and
survival of newborn neurons… (PMID 20474079)
• Return:
– GO:0008283 cell proliferation
– GO:0005125 cytokine activity
– GO:0048666 neuron development
(can be based on a judgment about the depth of
experimental evidence)
28. (CRAFT example)
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
GO:0005623 – “cell”
CL:0000000 – “cell”
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
SO:0001059 – “sequence_alteration” GO:0006810 – “transport”
SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
29. GO:0006900 – membrane budding
[Term]
id: GO:0006900
name: membrane budding
…
def: "The evagination of a membrane,
resulting in formation of a vesicle.”
…
synonym: "membrane evagination”
synonym: "nonselective vesicle assembly”
synonym: "vesicle biosynthesis”
synonym: "vesicle formation”
…
Variation in PMID: 12925238
• Lipid rafts play a key role in
membrane budding…
• …involvement of annexin A7 in
budding of vesicles…
• …Ca2+-mediated vesiculation
process was not impared.
• Red blood cells which lack the
ability to vesiculate cause…
• Having excluded a direct role
in vesicle formation…
GO vs NL
30. Comparing tool performance on CR
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Best performance for all tools on all ontologies
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
●
Systems
MetaMap
Concept Mapper
NCBO Annotator
Ontologies
GO_CC
GO_MF
GO_BP
SO
CL
PR
NCBITAXON
CHEBI
• NCBO Annotator
(96 combinations)
wholeWordOnly, filterNumber,
stopWords, stopWordsCaseSensitive,
minTermSize, withSynonyms
• MetaMap
(864 combinations)
model, gaps, wordOrder,
acronymAbb, derivationalVars,
scoreFilter, minTermSize
• Concept Mapper
(576 combinations)
searchStrategy, caseMatch, stemmer,
orderIndependentLookup,
findAllMatches, stopWords, synonyms
Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.
31. Literature alone is useful
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MF BP CC
Macro-averagedF-measure
Gene Ontology Branch
Baseline (co-mentions as predictions)
Co-mentions
BoW
Co-mentions + BoW
32. Literature features approach performance of
commonly used biological features
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MF BP CC
Macro-averagedF-measure
Trans/Localization
Homology
Network
Literature
All Combined
(and combining them with other features is even better!)
33. Manual inspection of misclassifications
Some false positives appear to have literature support:
• GCNT1 – carbohydrate metabolic process
(Q02742 - GO:0005975)
Genes related to carbohydrate metabolism
include PPP1R3C, B3GNT1, and GCNT1…
[PMID:23646466]
• CERS2 – ceramide biosynthetic process
(Q96G23 - GO:0046513)
…CersS2, which uses C22-CoA for
ceramide synthesis…
[PMID:22144673]
46. Conclusions
• The literature provides a significant resource for
biological function prediction
• The literature provides one ‘view’ of biological
knowledge and is best combined with other resources
• Even some simple strategies for extracting
associations from the literature can provide valuable
information, taken at large scale
– “bag of words” and co-occurrence models reasonable
starting point: capture implied relationships
– scope for integration of more targeted extracted
relationships (e.g. protein-protein interactions), with the
usual Precision/Recall tradeoff
47. Acknowledgements
• Los Alamos National Laboratory
– Michael Wall
• Colorado
– Larry Hunter (U. Colorado Denver)
– Christopher Funk (U. Colorado Denver)
– Asa Ben-Hur (Colorado State University)
– Indika Kahanda (Colorado State University)
• NICTA Victoria Research Laboratory
– Geoffrey Macintyre (U. Cambridge)
– Antonio Jimeno Yepes (IBM Research Australia)
– Cheng-Soon Ong (NICTA Canberra)
• Funding:
US NIH, US NSF, NICTA, Australian Research Council
48.
49. Machine learning for text analysis
Training set
Notes + labels
for classes of interest
Machine learning
algorithm
Words, Phrases,
Linguistic categories;
names of entities;
Domain concepts;
Document features
Biomedical
knowledge sources
UMLS
OBOs
Language processing
Model
Relating features
of the text to
classes of interest
50. Machine learning for text analysis
New text
to be classified
Words, Phrases,
Linguistic categories;
names of entities;
Domain concepts;
Document features
Biomedical
knowledge sources
UMLS
OBOs
Language processing
Model
Predicted
Classification
(label)