A software tool that classifies the mouse and human scientific literature in PubMed into different areas of research using citation networks and Medical Subject Heading MeSH Thesaurus to identify and study the popular areas of mouse-human research. It also classify the proteins in this literature citations into different biological systems using protein co-occurrence networks and Gene Ontology to investigate the proteins for which mouse is used as a model organism for human.
GigaScience Editor-in-Chief Laurie Goodman's talk at the International Conference on Genomics pre-conference press-session on the release of new unpublished datasets, and a new look beta version of their database: GigaDB.org
Finding the Patterns in the Big Data From Human Microbiome EcologyLarry Smarr
This document summarizes a talk on analyzing human microbiome data to better understand health and disease. It discusses how sequencing and supercomputing is used to map microbial ecology in hundreds of people. Advanced analytics tools like Ayasdi are helping discover patterns separating healthy from disease states like inflammatory bowel disease. Future goals include applying these techniques to larger datasets and using molecular networks to better understand disease development at the genetic and protein level.
This document introduces several researchers and summarizes their work. It provides brief biographies of Omar Akbari, Nichole Daringer, Rachel Dudek, Kei Fujiwara, Amar Ghodasara, George Khoury, and Joshua Leonard. It summarizes each of their current positions, education, non-scientific interests, and research focus. The research focuses include developing genetic control technologies for reducing mosquito-borne diseases, engineering cell-based biosensors, developing biosensor platforms in mammalian cells, constructing replicative artificial cells, controlling gene expression in prokaryotes, protein engineering and designing inhibitors, and integrating synthetic biology with systems biology.
Non-human primates in research and safety testingGreenFacts
Every year, more than 100 000 monkeys and apes are used for biomedical research around the world. Their genetic similarities to humans make them particularly suitable candidates for testing the safety of new drugs and for studying infectious diseases or the brain. But those very similarities to humans also raise specific ethical questions about their use for scientific experiments.
Are there alternatives to the use of non-human primates in research and testing? Would it be feasible to stop using them altogether?
Presentation for teaching faculty about resources, data, issues, and strategies for including personal genomics in the classroom, within the context of precision medicine as an overarching theme.
Scott Edmunds talk in the "Policies and Standards for Reproducible Research" session on Revolutionizing Data Dissemination: GigaScience, at the Genomic Standards Consortium meeting at Shenzhen. 6th March 2012
Functional annotation of invertebrate genomesSurya Saha
Functional annotation of the Asian citrus psyllid genome identified genes, assigned gene ontology terms, and mapped genes to pathways. Gene ontology and pathway analysis of differentially expressed genes between infected and uninfected psyllids identified enriched terms involved in the cytoskeleton, endocytosis, and mitochondrial dysfunction. Improved functional annotation using GOanna added depth to the gene ontology annotation and identified additional enriched pathways related to response to hypoxia and regulation of cytoskeletal remodeling.
GigaScience Editor-in-Chief Laurie Goodman's talk at the International Conference on Genomics pre-conference press-session on the release of new unpublished datasets, and a new look beta version of their database: GigaDB.org
Finding the Patterns in the Big Data From Human Microbiome EcologyLarry Smarr
This document summarizes a talk on analyzing human microbiome data to better understand health and disease. It discusses how sequencing and supercomputing is used to map microbial ecology in hundreds of people. Advanced analytics tools like Ayasdi are helping discover patterns separating healthy from disease states like inflammatory bowel disease. Future goals include applying these techniques to larger datasets and using molecular networks to better understand disease development at the genetic and protein level.
This document introduces several researchers and summarizes their work. It provides brief biographies of Omar Akbari, Nichole Daringer, Rachel Dudek, Kei Fujiwara, Amar Ghodasara, George Khoury, and Joshua Leonard. It summarizes each of their current positions, education, non-scientific interests, and research focus. The research focuses include developing genetic control technologies for reducing mosquito-borne diseases, engineering cell-based biosensors, developing biosensor platforms in mammalian cells, constructing replicative artificial cells, controlling gene expression in prokaryotes, protein engineering and designing inhibitors, and integrating synthetic biology with systems biology.
Non-human primates in research and safety testingGreenFacts
Every year, more than 100 000 monkeys and apes are used for biomedical research around the world. Their genetic similarities to humans make them particularly suitable candidates for testing the safety of new drugs and for studying infectious diseases or the brain. But those very similarities to humans also raise specific ethical questions about their use for scientific experiments.
Are there alternatives to the use of non-human primates in research and testing? Would it be feasible to stop using them altogether?
Presentation for teaching faculty about resources, data, issues, and strategies for including personal genomics in the classroom, within the context of precision medicine as an overarching theme.
Scott Edmunds talk in the "Policies and Standards for Reproducible Research" session on Revolutionizing Data Dissemination: GigaScience, at the Genomic Standards Consortium meeting at Shenzhen. 6th March 2012
Functional annotation of invertebrate genomesSurya Saha
Functional annotation of the Asian citrus psyllid genome identified genes, assigned gene ontology terms, and mapped genes to pathways. Gene ontology and pathway analysis of differentially expressed genes between infected and uninfected psyllids identified enriched terms involved in the cytoskeleton, endocytosis, and mitochondrial dysfunction. Improved functional annotation using GOanna added depth to the gene ontology annotation and identified additional enriched pathways related to response to hypoxia and regulation of cytoskeletal remodeling.
Model organisms are non-human species that are widely studied in laboratories to help scientists understand biological processes. They are usually easy to maintain and breed in a lab setting. The document discusses several important model organisms including mice, fruit flies, yeast, and bacteria. It provides details on their genomes, uses for research, and similarities to humans that make them valuable models. Key model organisms like mice and fruit flies have been widely used to study genetics, development, and disease due to their small genomes and short lifecycles.
Why Life is Difficult, and What We MIght Do About ItAnita de Waard
This document discusses connecting biological knowledge through claim-evidence networks. It outlines some of the challenges in biology like variability between specimens and gene expression changes. It then proposes that claim-evidence networks can be used to connect biological knowledge by linking experimental evidence to claims. Steps to build these networks include identifying claims in documents, structuring the evidence in databases, and automatically connecting the claims and evidence. Examples of efforts that link drug interactions to evidence and predict protein interactions across species are provided. However, it notes that more still needs to be done to fully realize this approach.
This document provides an introduction to the field of bioinformatics. It discusses how bioinformatics applies computing techniques to analyze large amounts of biological data from fields like molecular biology, medicine, and biotechnology. The document outlines the course contents, which will cover topics like biological databases, gene and protein analysis, phylogenetic analysis, and gene prediction. It provides background on related fields like computational biology, medical informatics, and proteomics. The history of bioinformatics is also summarized, from early genetics and discovery of DNA to advances in computing that enabled large-scale analysis of biological data.
The document provides an introduction to the field of bioinformatics. It discusses how bioinformatics applies computer science to analyze large amounts of biological data from fields like molecular biology, medicine, and biotechnology. It also outlines some of the main topics that will be covered in the course, including biological databases, gene and protein analysis, phylogenetic analysis, and gene prediction.
The document summarizes the Encyclopedia of DNA Elements (ENCODE) project. It describes ENCODE as a follow-up to the Human Genome Project that aims to identify all functional elements in the human genome, including regions that regulate genes. The document outlines the phases of the project and some of the high-throughput techniques used, such as ChIP-seq, DNase-seq, and MNase-seq. It also discusses how the data from ENCODE is being utilized and the future plans to expand the project.
ContentMine Presentation for WHO Health Data SeminarJenny Molloy
This document summarizes content mining technology and policy developments. It discusses what content and mining are, provides a brief history of content mining, and outlines legal considerations around copyright and database rights. It then describes the ContentMine software and pipeline for scraping, normalizing, and extracting facts from scholarly documents at scale. Examples of mining applications in chemistry, clinical trials, phylogenetics, and genome annotation are provided. The document concludes with a discussion of the potential value of content mining for public health researchers.
This document provides an overview of bioinformatics. It begins by explaining how bioinformatics emerged from the need to analyze vast amounts of genetic sequence data produced by projects like the Human Genome Project. It then defines bioinformatics as the field that develops tools and methods for understanding biological data by combining computer science, statistics, and other disciplines. The document outlines several goals and applications of bioinformatics, such as identifying genes and their functions, modeling protein structures, comparing genomes, and its uses in medicine, microbial research, and more. It also provides a brief history of important developments in bioinformatics and DNA sequencing.
This document provides an overview of the field of bioinformatics, including its history and applications. It discusses how bioinformatics merges biology, computer science, and information technology. It also summarizes key applications like using bioinformatics for human and animal genomics, molecular medicine, microbiology, and more. Microarray technology is introduced, explaining how DNA microarrays work to analyze gene expression levels. Different types of microarrays and platforms are also outlined.
This document provides an overview of the field of bioinformatics, including its history and applications. It discusses how bioinformatics merges biology, computer science, and information technology. It also summarizes key applications like using bioinformatics for human and animal genomics, molecular medicine, microbiology, and more. Microarray technology is introduced, explaining how DNA microarrays work to analyze gene expression levels. Different types of microarrays and platforms are also outlined.
Protein-protein interactions occur when two or more proteins bind together to carry out biological functions. Researchers map these interactions to build interactome networks that provide insight into cellular processes. Two main methods to detect interacting proteins are yeast two-hybrid systems and co-immunoprecipitation. Interactome networks along with other data are stored in databases to analyze how genes and proteins work together in pathways and disease states. Mapping the full interactome is challenging but will further biological understanding beyond what is known from genomic data alone.
This document provides information on biological databases, including their history, features, and classifications. It notes that the first protein sequenced was insulin in 1965, and the first genome sequenced was of a virus in 1995. Key features of biological databases discussed include their heterogeneity, high volume of data, uncertainty, data curation, integration, sharing, and dynamic nature as new data is added. Biological databases can be classified by data type, maintainer status, data access, source, design, and organism covered. The purpose of biological databases is to systematically organize and make available vast amounts of complex biological data.
The document discusses the potential of the Semantic Web to accelerate biological discovery and translational research by enabling data from diverse sources like sequences, microarrays, experiments, and medical records to be connected. It provides examples of how semantic technologies can be used to ask complex queries across linked data, generate hypotheses, and help researchers make new connections by bridging different sources of biological and biomedical knowledge. The document also outlines some of the challenges in building and scaling semantic databases to realize the full promise of the Semantic Web for accelerating research.
This document provides an overview of biotechnology and related topics. It defines biotechnology as the integration of science and engineering to life processes to solve problems or manufacture products. It discusses core biotechnologies like monoclonal antibodies, biosensors, cell culture, and recombinant DNA. It explains how these biotechnologies are used in areas like healthcare, pharmaceuticals, and environmental remediation. It also summarizes the science of cells, DNA, genes, and proteins as the foundations of modern biotechnology.
Maryann Martone
Making Sense of Biological Systems: Using Knowledge Mining to Improve and Validate Models of Living Systems; NIH COBRE Center for the Analysis of Cellular Mechanisms and Systems Biology, Montana State University, Bozeman, MT
August 24, 2012
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
Bioinformatics analyzes vast amounts of genomic and protein sequence data using computers and algorithms to understand the fundamental processes of life. It has become a key tool in biotechnology for applications like drug discovery. While DNA sequences life's code, molecular networks and regulatory interactions are more complex than once thought, with RNA and proteins also playing important roles before and after DNA. Continued advances in sequencing technology and data integration across multiple fields will be needed to fully unravel these biological systems.
Biological data is widely distributed over the web and can be retrieved using search engines like Google or data retrieval tools. Dedicated data retrieval tools for molecular biologists include Entrez, DBGET, and SRS which allow text searching of linked databases and sequence searching. Entrez, developed by NCBI, integrates information from databases including GenBank, PubMed, and OMIM. DBGET covers databases like GenBank, EMBL, and PDB. SRS, developed by EBI, integrates over 80 molecular biology databases.
Phenotypes and models at rgd -meet joe ratJennifer Smith
The Rat Genome Database developed the Phenotypes and Models Portal to provide physiological data and information on disease models for rats. The portal includes four branches: 1) Phenotype Data, 2) Strains and Models, 3) PhenoMiner (a phenotype data mining tool), and 4) Strain Medical Records. A new "Meet Joe Rat" section provides images of rats, phylogenetics, experimental techniques, and focused data on organ systems and disease models. The portal aims to link phenotype and genotype data to help researchers choose appropriate rat strains for studying disease mechanisms.
Model organisms are non-human species that are widely studied in laboratories to help scientists understand biological processes. They are usually easy to maintain and breed in a lab setting. The document discusses several important model organisms including mice, fruit flies, yeast, and bacteria. It provides details on their genomes, uses for research, and similarities to humans that make them valuable models. Key model organisms like mice and fruit flies have been widely used to study genetics, development, and disease due to their small genomes and short lifecycles.
Why Life is Difficult, and What We MIght Do About ItAnita de Waard
This document discusses connecting biological knowledge through claim-evidence networks. It outlines some of the challenges in biology like variability between specimens and gene expression changes. It then proposes that claim-evidence networks can be used to connect biological knowledge by linking experimental evidence to claims. Steps to build these networks include identifying claims in documents, structuring the evidence in databases, and automatically connecting the claims and evidence. Examples of efforts that link drug interactions to evidence and predict protein interactions across species are provided. However, it notes that more still needs to be done to fully realize this approach.
This document provides an introduction to the field of bioinformatics. It discusses how bioinformatics applies computing techniques to analyze large amounts of biological data from fields like molecular biology, medicine, and biotechnology. The document outlines the course contents, which will cover topics like biological databases, gene and protein analysis, phylogenetic analysis, and gene prediction. It provides background on related fields like computational biology, medical informatics, and proteomics. The history of bioinformatics is also summarized, from early genetics and discovery of DNA to advances in computing that enabled large-scale analysis of biological data.
The document provides an introduction to the field of bioinformatics. It discusses how bioinformatics applies computer science to analyze large amounts of biological data from fields like molecular biology, medicine, and biotechnology. It also outlines some of the main topics that will be covered in the course, including biological databases, gene and protein analysis, phylogenetic analysis, and gene prediction.
The document summarizes the Encyclopedia of DNA Elements (ENCODE) project. It describes ENCODE as a follow-up to the Human Genome Project that aims to identify all functional elements in the human genome, including regions that regulate genes. The document outlines the phases of the project and some of the high-throughput techniques used, such as ChIP-seq, DNase-seq, and MNase-seq. It also discusses how the data from ENCODE is being utilized and the future plans to expand the project.
ContentMine Presentation for WHO Health Data SeminarJenny Molloy
This document summarizes content mining technology and policy developments. It discusses what content and mining are, provides a brief history of content mining, and outlines legal considerations around copyright and database rights. It then describes the ContentMine software and pipeline for scraping, normalizing, and extracting facts from scholarly documents at scale. Examples of mining applications in chemistry, clinical trials, phylogenetics, and genome annotation are provided. The document concludes with a discussion of the potential value of content mining for public health researchers.
This document provides an overview of bioinformatics. It begins by explaining how bioinformatics emerged from the need to analyze vast amounts of genetic sequence data produced by projects like the Human Genome Project. It then defines bioinformatics as the field that develops tools and methods for understanding biological data by combining computer science, statistics, and other disciplines. The document outlines several goals and applications of bioinformatics, such as identifying genes and their functions, modeling protein structures, comparing genomes, and its uses in medicine, microbial research, and more. It also provides a brief history of important developments in bioinformatics and DNA sequencing.
This document provides an overview of the field of bioinformatics, including its history and applications. It discusses how bioinformatics merges biology, computer science, and information technology. It also summarizes key applications like using bioinformatics for human and animal genomics, molecular medicine, microbiology, and more. Microarray technology is introduced, explaining how DNA microarrays work to analyze gene expression levels. Different types of microarrays and platforms are also outlined.
This document provides an overview of the field of bioinformatics, including its history and applications. It discusses how bioinformatics merges biology, computer science, and information technology. It also summarizes key applications like using bioinformatics for human and animal genomics, molecular medicine, microbiology, and more. Microarray technology is introduced, explaining how DNA microarrays work to analyze gene expression levels. Different types of microarrays and platforms are also outlined.
Protein-protein interactions occur when two or more proteins bind together to carry out biological functions. Researchers map these interactions to build interactome networks that provide insight into cellular processes. Two main methods to detect interacting proteins are yeast two-hybrid systems and co-immunoprecipitation. Interactome networks along with other data are stored in databases to analyze how genes and proteins work together in pathways and disease states. Mapping the full interactome is challenging but will further biological understanding beyond what is known from genomic data alone.
This document provides information on biological databases, including their history, features, and classifications. It notes that the first protein sequenced was insulin in 1965, and the first genome sequenced was of a virus in 1995. Key features of biological databases discussed include their heterogeneity, high volume of data, uncertainty, data curation, integration, sharing, and dynamic nature as new data is added. Biological databases can be classified by data type, maintainer status, data access, source, design, and organism covered. The purpose of biological databases is to systematically organize and make available vast amounts of complex biological data.
The document discusses the potential of the Semantic Web to accelerate biological discovery and translational research by enabling data from diverse sources like sequences, microarrays, experiments, and medical records to be connected. It provides examples of how semantic technologies can be used to ask complex queries across linked data, generate hypotheses, and help researchers make new connections by bridging different sources of biological and biomedical knowledge. The document also outlines some of the challenges in building and scaling semantic databases to realize the full promise of the Semantic Web for accelerating research.
This document provides an overview of biotechnology and related topics. It defines biotechnology as the integration of science and engineering to life processes to solve problems or manufacture products. It discusses core biotechnologies like monoclonal antibodies, biosensors, cell culture, and recombinant DNA. It explains how these biotechnologies are used in areas like healthcare, pharmaceuticals, and environmental remediation. It also summarizes the science of cells, DNA, genes, and proteins as the foundations of modern biotechnology.
Maryann Martone
Making Sense of Biological Systems: Using Knowledge Mining to Improve and Validate Models of Living Systems; NIH COBRE Center for the Analysis of Cellular Mechanisms and Systems Biology, Montana State University, Bozeman, MT
August 24, 2012
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
Bioinformatics analyzes vast amounts of genomic and protein sequence data using computers and algorithms to understand the fundamental processes of life. It has become a key tool in biotechnology for applications like drug discovery. While DNA sequences life's code, molecular networks and regulatory interactions are more complex than once thought, with RNA and proteins also playing important roles before and after DNA. Continued advances in sequencing technology and data integration across multiple fields will be needed to fully unravel these biological systems.
Biological data is widely distributed over the web and can be retrieved using search engines like Google or data retrieval tools. Dedicated data retrieval tools for molecular biologists include Entrez, DBGET, and SRS which allow text searching of linked databases and sequence searching. Entrez, developed by NCBI, integrates information from databases including GenBank, PubMed, and OMIM. DBGET covers databases like GenBank, EMBL, and PDB. SRS, developed by EBI, integrates over 80 molecular biology databases.
Phenotypes and models at rgd -meet joe ratJennifer Smith
The Rat Genome Database developed the Phenotypes and Models Portal to provide physiological data and information on disease models for rats. The portal includes four branches: 1) Phenotype Data, 2) Strains and Models, 3) PhenoMiner (a phenotype data mining tool), and 4) Strain Medical Records. A new "Meet Joe Rat" section provides images of rats, phylogenetics, experimental techniques, and focused data on organ systems and disease models. The portal aims to link phenotype and genotype data to help researchers choose appropriate rat strains for studying disease mechanisms.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
3. Mouse Models in Research
Shares 99% of its
genome with humans
4. Mouse Models in Research
Shares 99% of its
genome with humans
Fewer ethical
concerns than other
mammal models
5. Mouse Models in Research
InexpensiveShares 99% of its
genome with humans
Fewer ethical
concerns than other
mammal models
Short generation
times
Small
6. The Mouse Trap. The Danger of Using one Lab Animal to Study Every Disease. Daniel Engber
http:http://www.slate.com/articles/health_and_science/the_mouse_trap/2011/11/lab_mice_are_they_limiting_our_understanding_of_huma
n_disease_.html. November 16, 2011
7. Designer Mice for Human Research
Photo taken from “Designer mice for human disease - A close view of Nobel Laureate : Oliver Smithies” Yau-Sheng Tsai, Pei-Jane Tsai,
Man-Jin Jiang, Cherng-Shyang Chang. http://proj.ncku.edu.tw/research/commentary/e/20071116/2.html December 9, 2014
8. Mouse Model is Not Perfect Though
Photo taken from: The Mouse Trap. The Danger of Using one Lab Animal to Study Every Disease. Daniel Engber
http:http://www.slate.com/articles/health_and_science/the_mouse_trap/2011/11/lab_mice_are_they_limiting_our_understanding_of_huma
n_disease_.html. November 16, 2011
9. Mouse Correlation with Human to Equivalent Diseases
Photo taken from “Genomic responses in mouse models poorly mimic human inflammatory diseases.” Seok, Warren, and Others.
Proceedings of the National Academy of Sciences. 110, no. 9 (2013): 3507-3512.
Rank correlation (R2
)
Percentage of genes changed
in the same direction
10. Proposed Research
Classify the Mouse-Human scientific literature
in PubMed into different areas of research
Citation Networks + MeSH Thesaurus
Identify and study the popular areas of
Mouse-Human research
What?
How?
Why?
11. Proposed Research
Classify the proteins in the Mouse-Human
citation pairs into different biological systems
Protein Co-occurrence Networks
+ Gene Ontology
Investigate the biological systems and
proteins for which Mouse is used
as a model organism for Human
What?
How?
Why?
12. Agenda
1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Protein and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
13. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
14. Getting Mouse and Human PubMed IDs
Uniprot
GOA
Mouse PubMed Identifiers (PMIDs)
Human PubMed Identifiers (PMIDs)
1. Get Mouse & Human
papers from Uniprot
15. Getting Mouse and Human PubMed IDs
Uniprot
GOA
Mouse PubMed Identifiers (PMIDs)
Human PubMed Identifiers (PMIDs)
1. Get Mouse & Human
papers from Uniprot
2. Query PubMed API for the
citation list for each article
16. Getting Mouse and Human PubMed IDs
Uniprot
GOA
Mouse PubMed Identifiers (PMIDs)
Human PubMed Identifiers (PMIDs)
1. Get Mouse & Human
papers from Uniprot
2. Query PubMed API for the
citation list for each article
.
.
<CitationList>
<PMID> 342342 </PMID>
<PMID> 423545 </PMID>
<PMID> 432598 </PMID>
</CitationList>
.
.
3. Parse PubMed XML response
and get the citation list
17. Getting Mouse and Human PubMed IDs
Uniprot
GOA
Mouse PubMed Identifiers (PMIDs)
Human PubMed Identifiers (PMIDs)
1. Get Mouse & Human
papers from Uniprot
2. Query PubMed API for the
citation list for each article
.
.
<CitationList>
<PMID> 342342 </PMID>
<PMID> 423545 </PMID>
<PMID> 432598 </PMID>
</CitationList>
.
.
3. Parse PubMed XML response
and get the citation list
Very few PubMed articles have
the citation list in their XML file!
18. Getting Mouse and Human Citation
List from Scopus
Uniprot
GOA
Mouse PubMed Identifiers (PMIDs)
Human PubMed Identifiers (PMIDs)
1. Get Mouse & Human
papers from Uniprot
2. Author HTTP GET request
with PMIDS
3. Parse Scopus JSON response
and get the citation list
.
.
{CitationList: {PMID: 342342},
{PMID: 423545}, {PMID: 432598}}
.
.
19. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
21. Building the Citation Network
H
M
M
H
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H
H
H
H
M → H
H → H
H → M
M → M
22. Building the Citation Network
H
M
M
H
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H
H
H
H
M → H
H → H
H → M
M → M
62%
3%
34%
Mouse Inter and Intra Citations
Mouse-Human Citations Mouse-Mouse Citations
Moue-Others Citations
34%
62%
4%
Human Inter and Intra Citations
Human-Others Citations Human-Human Citations
Human-Mouse Citations
23. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
24. Medical Subject Headings
Controlled vocabulary to index PubMed articles
Stored in a DAG-like structure
16 top level concepts at the root
Includes ~27K concepts (MeSH descriptors) all together
25. Medical Subject Headings
Controlled vocabulary to index PubMed articles
Stored in a DAG-like structure
16 top level concepts at the root
Includes ~27K concepts (MeSH descriptors) all together
We used MeSH to group the Mouse and
Human papers in the citation network
into classes of research
26. MeSH Structure Example
Digestive System Diseases
Gastrointestinal Diseases
Digestive System Neoplasms
Neoplasms by Site
Neoplasms
Stomach Diseases
Gastrointestinal Neoplasms
Stomach Neoplasms
28. To Do: Place in research areas
H
M
M
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H H
H Digestive
System
Diseases
Eye Diseases
Virus
Diseases
Immune
System
Diseases
Cardiovascular DiseasesSkin
Diseases
29. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
30. Number of Mouse and Human Papers in the MeSH
Disease Categories
32. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
33. GenBank
Protein: NP_e342 | PMID: 432432
kicgdkssgihygvitcegckgffrrsqqc
Protein: NP_452u1 | PMID: 483232
Adtltytlglsdgqlplgaspdlpeasacp
…..
1. Get the protein sequences Human
and papers
34. GenBank
Protein: NP_e342 | PMID: 432432
kicgdkssgihygvitcegckgffrrsqqc
Protein: NP_452u1 | PMID: 483232
Adtltytlglsdgqlplgaspdlpeasacp
…..
1. Get the protein sequences Human
and papers
...
PMID: 3213414
NP_u4323: sgihygvitcegckgffrrsqqc
NP_i4322: lplgaspdlpeasacfewrwts
NP_w3421: kicgdkssgihygvitceg
PMID: 2346414
NP_ti3423: vitcegckgckgffrrsqqc
NP_q4322f: ygvitcegeasacfewrwts
NP_x342u2: kicgdkssgihygvitceg
2. Group the proteins by their PMID
35. GenBank
Protein: NP_e342 | PMID: 432432
kicgdkssgihygvitcegckgffrrsqqc
Protein: NP_452u1 | PMID: 483232
Adtltytlglsdgqlplgaspdlpeasacp
…..
1. Get the protein sequences Human
and papers
...
PMID: 3213414
NP_u4323: sgihygvitcegckgffrrsqqc
NP_i4322: lplgaspdlpeasacfewrwts
NP_w3421: kicgdkssgihygvitceg
PMID: 2346414
NP_ti3423: vitcegckgckgffrrsqqc
NP_q4322f: ygvitcegeasacfewrwts
NP_x342u2: kicgdkssgihygvitceg
NP_u4323: sgihygvitcegckgffrrsqqc
NP_i4322: lplgaspdlpeasacfewrwts
NP_w3421: kicgdkssgihygvitceg
NP_ti3423: vitcegckgckgffrrsqqc
NP_q4322f: ygvitcegeasacfewrwts
NP_x342u2: kicgdkssgihygvitceg
NP_w3421: kicgdkssgihygvitceg
NP_ti3423: vitcegckgckgffrrsqqc
2. Group the proteins by their PMID
3. Intersect the Genbank papers with Scopus citations
37. Gene Ontology
Photo taken from: Gene Ontology Consortium. Ontology Structure. http://geneontology.org/page/ontology-structure Last access
December 13, 2014
38. Gene Ontology Annotation
Biological Process
Cellular Component
Molecular Function
cytochrome c
mitochondrial matrix
oxidoreductase activity
oxidative phosphorylation
39. NP_u4323: sgihygvitcegckgffrrsqqc
NP_i4322: lplgaspdlpeasacfewrwts
NP_w3421: kicgdkssgihygvitceg
NP_ti3423: vitcegckgckgffrrsqqc
NP_q4322f: ygvitcegeasacfewrwts
NP_x342u2: kicgdkssgihygvitceg
FASTA File
BLAST
DB
1. Create BLAST query in FASTA format
2. Create BLAST Database from Swissprot
Human Flat File
Getting GO Terms
40. NP_u4323: sgihygvitcegckgffrrsqqc
NP_i4322: lplgaspdlpeasacfewrwts
NP_w3421: kicgdkssgihygvitceg
NP_ti3423: vitcegckgckgffrrsqqc
NP_q4322f: ygvitcegeasacfewrwts
NP_x342u2: kicgdkssgihygvitceg
FASTA File
BLAST
DB
NP_u4323: GO1, GO5, GO4
NP_i4322: GO5, GO9
NP_w3421: GO4, GO6
...
1. Create BLAST query in FASTA format
2. Create BLAST Database from Swissprot
Human Flat File
3. Do BLAST with e-value = 10-8
4. Parse the BLAST XML response
and get the GO terms for the top hits
Getting GO Terms
41. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Cited Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
46. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
48. To Do: Place in Protein Biological Systems
lactase activity
serotonin
Receptor
activity
signal sequence
binding
signal transducer
activitynucleotide
binding
ATP
binding
49. 1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Proteins and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary
50. Summary
Cit-Net connects citing Mouse papers with cited Human
papers in the PubMed database
MeSH is used to classify the citation network nodes into
different classes of research
PCoC network connects the proteins in the citing Mouse
papers with proteins in the cited Human papers
GO is used to group the P-P and P-C-P network nodes
into different classes of MFs, BPs and Ccs
51. Timetable
Jan Feb Mar Apr May
Database Creation and
Data migration
Citation Network
Classification
PCoC Networks Building
PCoC Networks
Classification
PCoC Networks Analysis